Asp Forum - Grabbing data off a webpage

Bil Kleb

12/17/2006 3:43:00 PM

OK, so I haven't done this in years.

What's the "modern" way of grabbing the data off
a webpage, e.g.,

http://yorkcountyschools.org/mves/arlist...

My initial attempt has been focused on Hpricot,

require 'rubygems'
require 'open-uri'
require 'hpricot'
doc = Hpricot(open('http://yorkcountyschools.org/mves/arlist...'))

and I can find doc/"th" and doc/"tr", but what's
the best way to cram them into an array of structs
or something?

Thanks,
--
Bil Kleb
http://funit.rub...

12 Answers

Gregory Brown

12/17/2006 4:23:00 PM

On 12/17/06, Bil Kleb <Bil.Kleb@nasa.gov> wrote:
> OK, so I haven't done this in years.
>
> What's the "modern" way of grabbing the data off
> a webpage, e.g.,
>
> http://yorkcountyschools.org/mves/arlist...
>
> My initial attempt has been focused on Hpricot,
>
> require 'rubygems'
> require 'open-uri'
> require 'hpricot'
> doc = Hpricot(open('http://yorkcountyschools.org/mves/arlist...'))
>
> and I can find doc/"th" and doc/"tr", but what's
> the best way to cram them into an array of structs
> or something?

I've actually been needing to do something like this for work and
haven't gotten around to it, so I'll take a stab at it.

require "ruport"
column_names = (doc/"th")[1..-1].map { |r| (r/"p").text }
rows = (doc/"tr")[3..-1]
parsed_rows = rows.inject { |s,a|
s << (a/"td").map { |r| (r/"td").text }
}
table = parsed_rows.to_table(column_names)

Now, I've pastied some of the things you can do from here, because
they wont translate to email well.

http://pastie.cabo...

Note, my hpricot code is sort-of hackish, cleaning that up might be a
good idea, but Ruport[0] might still be a good idea for representing
the data.

Hope this helps!

-greg

[0] http://ruport.in...

Gregory Brown

12/17/2006 4:26:00 PM

On 12/17/06, Gregory Brown <gregory.t.brown@gmail.com> wrote:
> On 12/17/06, Bil Kleb <Bil.Kleb@nasa.gov> wrote:
> > OK, so I haven't done this in years.
> >
> > What's the "modern" way of grabbing the data off
> > a webpage, e.g.,
> >
> > http://yorkcountyschools.org/mves/arlist...
> >
> > My initial attempt has been focused on Hpricot,
> >
> > require 'rubygems'
> > require 'open-uri'
> > require 'hpricot'
> > doc = Hpricot(open('http://yorkcountyschools.org/mves/arlist...'))
> >
> > and I can find doc/"th" and doc/"tr", but what's
> > the best way to cram them into an array of structs
> > or something?
>
> I've actually been needing to do something like this for work and
> haven't gotten around to it, so I'll take a stab at it.
>
> require "ruport"
> column_names = (doc/"th")[1..-1].map { |r| (r/"p").text }
> rows = (doc/"tr")[3..-1]
> parsed_rows = rows.inject { |s,a|
> s << (a/"td").map { |r| (r/"td").text }
> }
> table = parsed_rows.to_table(column_names)
>
> Now, I've pastied some of the things you can do from here, because
> they wont translate to email well.
>
> http://pastie.cabo...

Yuck, seems to have made a mess of the text output.
Here it is better formatted:

http://pastie.caboo.se/...

Peter Szinek

12/17/2006 5:05:00 PM

Hi Bill,

How about:

require 'rubygems'
require 'open-uri'
require 'hpricot'
require 'enumerator'

Record = Struct.new("Record", :id, :title, :author, :book_level, :points)
records = []

cells =
Hpricot(open('http://yorkcountyschools.org/mves/arlist/3-3....))/"/html/body/table/tbody/tr//td"

cells.map { |elem| elem.inner_html }.each_slice(5) do |slice|
records << Record.new(*slice)
end

HTH,
Peter

__
http://www.rubyra...

Gregory Brown

12/17/2006 6:11:00 PM

On 12/17/06, Peter Szinek <peter@rubyrailways.com> wrote:
> Hi Bill,
>
> How about:
>
> require 'rubygems'
> require 'open-uri'
> require 'hpricot'
> require 'enumerator'
>
> Record = Struct.new("Record", :id, :title, :author, :book_level, :points)
> records = []
>
> cells =
> Hpricot(open('http://yorkcountyschools.org/mves/arlist/3-3....))/"/html/body/table/tbody/tr//td"
>
>
> cells.map { |elem| elem.inner_html }.each_slice(5) do |slice|
> records << Record.new(*slice)
> end

clever solution peter.

If you wanted to adapt this to use Ruport instead of a Struct (and get
the features I showed)
Try:

records = [].to_table([:id, :title, :author, :book_level, :points])

and then replace the appending code with

records << slice

This would allow struct-like, hash-like, and array-like access as well
as access to Ruport's data manipulation and formatting tools.

Peter Szinek

12/17/2006 6:19:00 PM

> This would allow struct-like, hash-like, and array-like access as well
> as access to Ruport's data manipulation and formatting tools.

Thx for the pointer Gregory, I did not know about Ruport yet - seems
very interesting, I will definitely check it out.

Cheers,
Peter

__
http://www.rubyra...

Gregory Brown

12/17/2006 6:27:00 PM

On 12/17/06, Peter Szinek <peter@rubyrailways.com> wrote:
> > This would allow struct-like, hash-like, and array-like access as well
> > as access to Ruport's data manipulation and formatting tools.
>
> Thx for the pointer Gregory, I did not know about Ruport yet - seems
> very interesting, I will definitely check it out.

It might be overkill if all you needed was struct like access to your
data, but it would sure come in handy if you had some more complex
needs...

William James

12/17/2006 8:40:00 PM

Bil Kleb wrote:
> OK, so I haven't done this in years.
>
> What's the "modern" way of grabbing the data off
> a webpage, e.g.,
>
> http://yorkcountyschools.org/mves/arlist...
>
> My initial attempt has been focused on Hpricot,
>
> require 'rubygems'
> require 'open-uri'
> require 'hpricot'
> doc = Hpricot(open('http://yorkcountyschools.org/mves/arlist...'))
>
> and I can find doc/"th" and doc/"tr", but what's
> the best way to cram them into an array of structs
> or something?
>
> Thanks,
> --
> Bil Kleb
> http://funit.rub...

require 'net/http'
http = Net::HTTP.new( "yorkcountyschools.org" )
resp, data = http.get( "/mves/arlist/3-3.4.htm", nil )

table = data.scan( %r{<tr>(.*?)</tr}im ).flatten.
map{|s| s.scan( %r{<td>(.*?)</td>}i ).flatten }.
reject{|ary| ary.size != 5}

p table

Gregory Brown

12/18/2006 12:46:00 AM

On 12/17/06, William James <w_a_x_man@yahoo.com> wrote:

> require 'net/http'
> http = Net::HTTP.new( "yorkcountyschools.org" )
> resp, data = http.get( "/mves/arlist/3-3.4.htm", nil )

require "open-uri"
body = open("yorkcountyschools.org/mves/arlist/3-3.4.htm").read

Gregory Brown

12/18/2006 12:47:00 AM

On 12/17/06, Gregory Brown <gregory.t.brown@gmail.com> wrote:

> require "open-uri"
> body = open("yorkcountyschools.org/mves/arlist/3-3.4.htm").read

whoops... need the http://

Bil Kleb

12/18/2006 1:14:00 PM

Bil Kleb wrote:
>
> My initial attempt has been focused on Hpricot,
> [..] I can find doc/"th" and doc/"tr", but what's
> the best way to cram them into an array of structs
> or something?

Thanks everyone; I'm on my way now.

Regards,
--
Bil Kleb
http://fun3d.lar...

comp.lang.ruby

Grabbing data off a webpage

Bil Kleb

Gregory Brown

Gregory Brown

Peter Szinek

Gregory Brown

Peter Szinek

Gregory Brown

William James

Gregory Brown

Gregory Brown

Bil Kleb

x Login to ForumsZone