Asp Forum - Extracting Data from a Webpage

Tj Superfly

1/27/2008 3:22:00 AM

Hello everyone.

I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

I give it the URL, say www.google.com. It then gives me "Google" - it's
title.

Then also, is there anyway that the program could extract the next 5
characters - after a certain phrase that doesn't change on the webpage?

Thanks for your help in advance!
--
Posted via http://www.ruby-....

18 Answers

Steve Ross

1/27/2008 4:08:00 AM

On Jan 26, 2008, at 7:21 PM, Tj Superfly wrote:

> Hello everyone.
>
> I was wondering if anyone knew a way to extract the web page title off
> of a specific URL that you input into a program?
>
> I give it the URL, say www.google.com. It then gives me "Google" -
> it's
> title.
>
> Then also, is there anyway that the program could extract the next 5
> characters - after a certain phrase that doesn't change on the
> webpage?
>
> Thanks for your help in advance!
> --
> Posted via http://www.ruby-....
>

http://code.whytheluckystiff.ne...

It's a snap.

7stud --

1/27/2008 4:42:00 AM

Tj Superfly wrote:
> Hello everyone.
>
> I was wondering if anyone knew a way to extract the web page title off
> of a specific URL that you input into a program?
>
> I give it the URL, say www.google.com. It then gives me "Google" - it's
> title.
>
> Then also, is there anyway that the program could extract the next 5
> characters - after a certain phrase that doesn't change on the webpage?
>
> Thanks for your help in advance!

You can do something like this:

require 'open-uri'

url = "http://www.google...

open(url) do |f|
f.each do |line|
if md_obj = /<title>(.*)<\/title>/.match(line)
puts md_obj[1]
end

if md_obj = /type=(.{6})/.match(line)
puts md_obj[1]
end

end
end

Ruby also has various html parsing libraries that allow you to search
html documents by tag name, tag position, etc.
--
Posted via http://www.ruby-....

7stud --

1/27/2008 5:21:00 AM

7stud -- wrote:
> You can do something like this:
>
> require 'open-uri'
>
> url = "http://www.google...
>
> open(url) do |f|
> f.each do |line|
> if md_obj = /<title>(.*)<\/title>/.match(line)
> puts md_obj[1]
> end
>
> if md_obj = /type=(.{6})/.match(line)
> puts md_obj[1]
> end
>
> end
> end
>

This should be more efficient:

require 'open-uri'

url = "http://www.google...
title_re = Regexp.new(/<title>(.*)<\/title>/)
text_re = Regexp.new(/type=(.{5})/)

open(url) do |f|
f.each do |line|
if md_obj = title_re.match(line)
puts md_obj[1]
end

if md_obj = text_re.match(line)
puts md_obj[1]
break
end

end
end

--output:
Google
hidde #first 5 chars of 'hidden'

--
Posted via http://www.ruby-....

William James

1/27/2008 7:49:00 AM

On Jan 26, 9:21 pm, Tj Superfly <nonstickg...@verizon.net> wrote:
> Hello everyone.
>
> I was wondering if anyone knew a way to extract the web page title off
> of a specific URL that you input into a program?
>
> I give it the URL, say www.google.com. It then gives me "Google" - its
> title.

"www.google.com"[/(www\.)?(.*)\./,2].capitalize
==>"Google"
"google.com"[/(www\.)?(.*)\./,2].capitalize
==>"Google"

Tj Superfly

1/27/2008 4:35:00 PM

> This should be more efficient:
>
> require 'open-uri'
>
> url = "http://www.google...
> title_re = Regexp.new(/<title>(.*)<\/title>/)
> text_re = Regexp.new(/type=(.{5})/)
>
> open(url) do |f|
> f.each do |line|
> if md_obj = title_re.match(line)
> puts md_obj[1]
> end
>
> if md_obj = text_re.match(line)
> puts md_obj[1]
> break
> end
>
> end
> end
>
> --output:
> Google
> hidde #first 5 chars of 'hidden'

I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions? I did try the other clip of code posted here, but got
more errors than this one. =/ I'm reading up on that link posted in the
2nd post to see if I can figure any of this out.

Thanks.

--
Posted via http://www.ruby-....

7stud --

1/27/2008 8:25:00 PM

Tj Superfly wrote:
> I receive this eror message when trying this code.
>
> DENTIFIER, expecting $end
> endndndreakmd_obj[1]_re.match(line))/title>/)
>
> Any suggestions?
>

1) Learn some basic ruby?

2) Learn how to post a question on a computer programming forum?
--
Posted via http://www.ruby-....

Tj Superfly

1/27/2008 9:57:00 PM

7stud -- wrote:
> Tj Superfly wrote:
>> I receive this eror message when trying this code.
>>
>> DENTIFIER, expecting $end
>> endndndreakmd_obj[1]_re.match(line))/title>/)
>>
>> Any suggestions?
>>
>
> 1) Learn some basic ruby?
>
> 2) Learn how to post a question on a computer programming forum?

Anyone else know what the matter is?
--
Posted via http://www.ruby-....

7stud --

1/27/2008 10:26:00 PM

Tj Superfly wrote:
> 7stud -- wrote:
>> Tj Superfly wrote:
>>> I receive this eror message when trying this code.
>>>
>>> DENTIFIER, expecting $end
>>> endndndreakmd_obj[1]_re.match(line))/title>/)
>>>
>>> Any suggestions?
>>>
>>
>> 1) Learn some basic ruby?
>>
>> 2) Learn how to post a question on a computer programming forum?
>
> Anyone else know what the matter is?

How to post a question on a computer programming Forum:

1) Post a simple example program that demonstrates your problem.

2) Post the error message in its entirety--not an unintelligible portion
of it.

3) Post your question about the code.

4) Use a descriptive title for your post-- not something like
"URGENT...HELP ME!"

5) Proof read and spell check your post before clicking submit.
--
Posted via http://www.ruby-....

fedzor

1/27/2008 10:49:00 PM

On Jan 27, 2008, at 4:57 PM, Tj Superfly wrote:

> 7stud -- wrote:
>> Tj Superfly wrote:
>>> I receive this eror message when trying this code.
>>>
>>> DENTIFIER, expecting $end
>>> endndndreakmd_obj[1]_re.match(line))/title>/)
>>>
>>> Any suggestions?

I believe that $end means you're missing some sort of end delimiter,
but NOT 'end'. Check for {} or / / for regexp

Also, if you can, have your editor do an autoformat thing so you can
see where the indentation screws up.

Marc Heiler

1/27/2008 11:17:00 PM

> http://code.whytheluckystiff.ne...
> It's a snap.

I believe hpricot, as fine as it may be, is a little bit overkill for
such a task.

At best a simple task should remain simple, at least as simple as
possible.
--
Posted via http://www.ruby-....

comp.lang.ruby

Extracting Data from a Webpage

Tj Superfly

Steve Ross

7stud --

7stud --

William James

Tj Superfly

7stud --

Tj Superfly

7stud --

fedzor

Marc Heiler

x Login to ForumsZone