Asp Forum - Scan HTML - comp.lang.ruby

Tom Arra

3/1/2008 3:22:00 AM

So I am new to Ruby scripting so I am not sure if this is possible or
not. I want to make a script that will load a webpage and then search
through the HTML of that page until it hits a certain tag. Once it hits
that tag it need to grab all of the text between the tag and the
appropriate end tag. Is something like this possible?

Example
<html>
<body>
<h3>test</h3>
</body>
</html>

I want the script to return "test"
--
Posted via http://www.ruby-....

14 Answers

Gregory Seidman

3/1/2008 3:35:00 AM

On Sat, Mar 01, 2008 at 12:22:12PM +0900, Tom Arra wrote:
> So I am new to Ruby scripting so I am not sure if this is possible or
> not. I want to make a script that will load a webpage and then search
> through the HTML of that page until it hits a certain tag. Once it hits
> that tag it need to grab all of the text between the tag and the
> appropriate end tag. Is something like this possible?
>
> Example
> <html>
> <body>
> <h3>test</h3>
> </body>
> </html>
>
> I want the script to return "test"

You want the Hpricot gem.

require 'rubygems'
require 'hpricot'

html = <<EOF
<html>
<body>
<h3>test</h3>
</body>
</html>
EOF

doc = Hpricot(html)

puts (doc/'h3').first.inner_text

--Greg

William James

3/1/2008 4:49:00 AM

On Feb 29, 9:22 pm, Tom Arra <turtleman14...@gmail.com> wrote:
> So I am new to Ruby scripting so I am not sure if this is possible or
> not. I want to make a script that will load a webpage and then search
> through the HTML of that page until it hits a certain tag. Once it hits
> that tag it need to grab all of the text between the tag and the
> appropriate end tag. Is something like this possible?
>
> Example
> <html>
> <body>
> <h3>test</h3>
> </body>
> </html>
>
> I want the script to return "test"
> --
> Posted viahttp://www.ruby-....

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title\s*>(.*?)</title\s*>}mi, 1 ]

If the tag can contain attributes, e.g.,
<title foo="bar">:

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]

William James

3/1/2008 4:52:00 AM

On Feb 29, 9:34 pm, Gregory Seidman <gsslist+r...@anthropohedron.net>
wrote:
> On Sat, Mar 01, 2008 at 12:22:12PM +0900, Tom Arra wrote:
> > So I am new to Ruby scripting so I am not sure if this is possible or
> > not. I want to make a script that will load a webpage and then search
> > through the HTML of that page until it hits a certain tag. Once it hits
> > that tag it need to grab all of the text between the tag and the
> > appropriate end tag. Is something like this possible?
>
> > Example
> > <html>
> > <body>
> > <h3>test</h3>
> > </body>
> > </html>
>
> > I want the script to return "test"
>
> You want the Hpricot gem.

No, he doesn't.

Marc Heiler

3/1/2008 6:53:00 AM

> You want the Hpricot gem.

Personally I agree on that, insofar that I think the most simple,
"default" ruby solution is better than a specialized one. In this case I
think the better solution is Net::HTTP
--
Posted via http://www.ruby-....

Todd Benson

3/1/2008 7:00:00 AM

On Fri, Feb 29, 2008 at 10:55 PM, William James <w_a_x_man@yahoo.com> wrote:
> On Feb 29, 9:34 pm, Gregory Seidman <gsslist+r...@anthropohedron.net>
> wrote:
>
> > On Sat, Mar 01, 2008 at 12:22:12PM +0900, Tom Arra wrote:
> > > So I am new to Ruby scripting so I am not sure if this is possible or
> > > not. I want to make a script that will load a webpage and then search
> > > through the HTML of that page until it hits a certain tag. Once it hits
> > > that tag it need to grab all of the text between the tag and the
> > > appropriate end tag. Is something like this possible?
> >
> > > Example
> > > <html>
> > > <body>
> > > <h3>test</h3>
> > > </body>
> > > </html>
> >
> > > I want the script to return "test"
> >
> > You want the Hpricot gem.
>
> No, he doesn't.

Same question, different people, same strict requirements. It sounds
a little like homework. In that case, I suppose some of the regexp
solutions provided will work (for this small use case).

I still think Florian said it best, though. Unless you can "stack",
you won't be able to correctly reveal the components inside a nested
language structure. I haven't looked into the theory, but I can
attest to the pain in the arse I've had trying to scrape with regular
expressions.

Todd

Tom Arra

3/1/2008 12:44:00 PM

William James wrote:
> On Feb 29, 9:22 pm, Tom Arra <turtleman14...@gmail.com> wrote:
>> </body>
>> </html>
>>
>> I want the script to return "test"
>> --
>> Posted viahttp://www.ruby-....
>
> require 'net/http'
> puts Net::HTTP.new('www.google.com').get('/').
> body[ %r{<title\s*>(.*?)</title\s*>}mi, 1 ]
>
> If the tag can contain attributes, e.g.,
> <title foo="bar">:
>
> require 'net/http'
> puts Net::HTTP.new('www.google.com').get('/').
> body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]

So far I think this is closest to what I am looking for. I need to go to
a website that has a server information and pull that out of the HTML.
Then take that info and spit it back out to the user. If I am
understanding the code above, it at least does the first part which I
had no clue how to do.
--
Posted via http://www.ruby-....

Tom Arra

3/1/2008 1:56:00 PM

Tom Arra wrote:
> William James wrote:
>> On Feb 29, 9:22 pm, Tom Arra <turtleman14...@gmail.com> wrote:
>>> </body>
>>> </html>
>>>
>>> I want the script to return "test"
>>> --
>>> Posted viahttp://www.ruby-....
>>
>> require 'net/http'
>> puts Net::HTTP.new('www.google.com').get('/').
>> body[ %r{<title\s*>(.*?)</title\s*>}mi, 1 ]
>>
>> If the tag can contain attributes, e.g.,
>> <title foo="bar">:
>>
>> require 'net/http'
>> puts Net::HTTP.new('www.google.com').get('/').
>> body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]
>
> So far I think this is closest to what I am looking for. I need to go to
> a website that has a server information and pull that out of the HTML.
> Then take that info and spit it back out to the user. If I am
> understanding the code above, it at least does the first part which I
> had no clue how to do.

Well I just tried it and it worked like a charm. My next thing is to
limit what it brings back.

Example
<h3>blah blah blah 7.0.0.3.4 blah blah blah</h3>

I want to pull just the 7.0.0.3.4 and none of the words. I am sure this
is going to have to deal with more regular expressions but I never
really understood how to use them well.
--
Posted via http://www.ruby-....

William James

3/1/2008 3:01:00 PM

On Mar 1, 7:56 am, Tom Arra <turtleman14...@gmail.com> wrote:
> Tom Arra wrote:
> > William James wrote:
> >> On Feb 29, 9:22 pm, Tom Arra <turtleman14...@gmail.com> wrote:
> >>> </body>
> >>> </html>
>
> >>> I want the script to return "test"
> >>> --
> >>> Posted viahttp://www.ruby-....
>
> >> require 'net/http'
> >> puts Net::HTTP.new('www.google.com').get('/').
> >> body[ %r{<title\s*>(.*?)</title\s*>}mi, 1 ]
>
> >> If the tag can contain attributes, e.g.,
> >> <title foo="bar">:
>
> >> require 'net/http'
> >> puts Net::HTTP.new('www.google.com').get('/').
> >> body[ %r{<title(?:\s*|\s+.*?)>(.*?)</title\s*>}mi, 1 ]
>
> > So far I think this is closest to what I am looking for. I need to go to
> > a website that has a server information and pull that out of the HTML.
> > Then take that info and spit it back out to the user. If I am
> > understanding the code above, it at least does the first part which I
> > had no clue how to do.
>
> Well I just tried it and it worked like a charm. My next thing is to
> limit what it brings back.
>
> Example
> <h3>blah blah blah 7.0.0.3.4 blah blah blah</h3>
>
> I want to pull just the 7.0.0.3.4 and none of the words. I am sure this
> is going to have to deal with more regular expressions but I never
> really understood how to use them well.
> --
> Posted viahttp://www.ruby-....

E:\>irb --prompt xmp
s = " <h3>blah blah 7.0.0.3.4 blah</h3>"
==>" <h3>blah blah 7.0.0.3.4 blah</h3>"
# Find a substring composed of numerals and dots that is
# at least 3 characters long.
s[ /[\d.]{3,}/ ]
==>"7.0.0.3.4"

Tom Arra

3/1/2008 6:06:00 PM

William James wrote:
> On Mar 1, 7:56 am, Tom Arra <turtleman14...@gmail.com> wrote:
>> >> require 'net/http'
>> > So far I think this is closest to what I am looking for. I need to go to
>>
>> I want to pull just the 7.0.0.3.4 and none of the words. I am sure this
>> is going to have to deal with more regular expressions but I never
>> really understood how to use them well.
>> --
>> Posted viahttp://www.ruby-....
>
> E:\>irb --prompt xmp
> s = " <h3>blah blah 7.0.0.3.4 blah</h3>"
> ==>" <h3>blah blah 7.0.0.3.4 blah</h3>"
> # Find a substring composed of numerals and dots that is
> # at least 3 characters long.
> s[ /[\d.]{3,}/ ]
> ==>"7.0.0.3.4"

Your really good at this stuff! One thing i noticed is that it works
perfectly for the regular domain but as soon as I put a full URL into
the Net::HTTP.new command it starts to throw errors. Any ideas.
--
Posted via http://www.ruby-....

Tom Arra

3/1/2008 6:44:00 PM

Heres what I have so far

#! /usr/bin/ruby
require 'net/http'

text = Net::HTTP.new('www.tomarra.com').get('/').body[
%r{<title\s*>(.*?)</title\s*>}mi, 1 ]
print "TomArra.com Title Tag: "
print text
print "\n"
s = "<h3>blah blah 7.0.0.4.3 blah blah</h3>"[ /[\d.]{3,}/ ]
print s

puts Net::HTTP.new('www.tomarra.com/credits.html').get('/').body[
%r{<center\s*>(.*?)</center\s*>}mi, 1 ]

and here is my output
TomArra.com Title Tag: Welcome To TomArra.com
7.0.0.4.3
SocketError: getaddrinfo: nodename nor servname provided, or not known

method initialize in http.rb at line 564
method open in http.rb at line 564
method connect in http.rb at line 564
method timeout in timeout.rb at line 48
method timeout in timeout.rb at line 76
method connect in http.rb at line 564
method do_start in http.rb at line 557
method start in http.rb at line 546
method request in http.rb at line 1044
method get in http.rb at line 781
at top level in simple.rb at line 11
Program exited.
--
Posted via http://www.ruby-....

comp.lang.ruby

Scan HTML

Tom Arra

Gregory Seidman

William James

William James

Marc Heiler

Todd Benson

Tom Arra

Tom Arra

William James

Tom Arra

Tom Arra

x Login to ForumsZone