Asp Forum - [Newbie] Getting data from html-ish like crap.

sidhellfire

3/1/2006 10:20:00 AM

Hi,
I wanted to learn something, and choosed ruby,
since it looked awesome, and can't say it isn't.
I am not expierienced programmer (tried some
Pascal, then PHP), and decided to do something
small, but usefull. Let's get straight into the
problem.

At the url:
https://www.knightonlineworld.com/index.php?pg=rankings&sub=2&radServer=1&cl...
I've got something similar to html (at least
it's not table based, but still damn ugly code
there) with online statistics.

I don't want to parse that one, i just want to
'crop that crap' and retrive informations from
data inside one element a div with an
id="bleet". There are tables in there, but seems
impossible to navigate there.

I am really suck at strings :(

<tr bgcolor="#FFFFFF">
<td align="center" >Quatrina</td> # i want this
<td align="center" >52</td> #
<td align="center" >Shaman</td> #
<td align="center" >563</td> # and this one
</tr> # preferable everything :P

I would want to hash that data, to have it
usefull in future (aiming a rails app in
future), but selecting two columns in each row,
located in the last table in the id'ed element
placed in not-well-made document looks
impossible for me. I don't even know where to
start, and it's FAR away from the things i
wanted to do (counting numbers*, assinging
additional data).

All i've done already is getting the document:

> require 'uri'
> require 'net/http'
>
> trg = "https://www.knightonlineworld.com/index.php?pg=rankings&sub=2&radServer=1&cl..."
> puts 'processing' + cel + " :\n"
>
> r = Net::HTTP.get_response(URI.parse(trg).host, URI.parse(trg).path)
>
> puts r.body

Thanks for reading.

*/Notice it does count "Loyality" wrong/

4 Answers

William James

3/1/2006 11:15:00 AM

spam_monkey wrote:
> Hi,
> I wanted to learn something, and choosed ruby,
> since it looked awesome, and can't say it isn't.
> I am not expierienced programmer (tried some
> Pascal, then PHP), and decided to do something
> small, but usefull. Let's get straight into the
> problem.
>
> At the url:
> https://www.knightonlineworld.com/index.php?pg=rankings&sub=2&radServer=1&cl...
> I've got something similar to html (at least
> it's not table based, but still damn ugly code
> there) with online statistics.
>
> I don't want to parse that one, i just want to
> 'crop that crap' and retrive informations from
> data inside one element a div with an
> id="bleet". There are tables in there, but seems
> impossible to navigate there.
>
> I am really suck at strings :(
>
> <tr bgcolor="#FFFFFF">
> <td align="center" >Quatrina</td> # i want this
> <td align="center" >52</td> #
> <td align="center" >Shaman</td> #
> <td align="center" >563</td> # and this one
> </tr> # preferable everything :P
>
> I would want to hash that data, to have it
> usefull in future (aiming a rails app in
> future), but selecting two columns in each row,
> located in the last table in the id'ed element
> placed in not-well-made document looks
> impossible for me. I don't even know where to
> start, and it's FAR away from the things i
> wanted to do (counting numbers*, assinging
> additional data).
>
>
> All i've done already is getting the document:
>
> > require 'uri'
> > require 'net/http'
> >
> > trg = "https://www.knightonlineworld.com/index.php?pg=rankings&sub=2&radServer=1&cl..."
> > puts 'processing' + cel + " :\n"
> >
> > r = Net::HTTP.get_response(URI.parse(trg).host, URI.parse(trg).path)
> >
> > puts r.body

r.body does not contain "Quatrina".

jgbailey

3/1/2006 4:43:00 PM

First, don't use Net::HTTP. Require 'open-uri' at the top and you can
simplify your code a lot:

open(https://www.knighto...
>
> .com/index.php?pg=rankings&sub=2&radServer=1&clanid=12199) do |page|

html = page.gets(nil)
end

Which will get the whole document into the 'html' variable.

Next, look at StringScanner and using regular expressions. It will allow you
to iterate through your document quickly and pick up the columns you want.
Some pseudo-code might look like:

scanner = StringScanner.new
while scanner.check(/<tr>.*<td>(.*)<\/td>.*<td>.*<\/td>.*<td>.*<\/td>.*<td>(.*)<\/td>.*</tr>/m)
do
name = scanner[1]
points = scanner[2]
end

That will extract the name and level from each row. The parantheses in the
regular expression are "capture groups", and they relate to the assignments
in the loop (name = scanner[1], points = scanner[2]). The 'm' following the
regular expression makes sure a multi-line match is performed, which is
probably necessary as the table cells are on different lines.

For further syntax and library help check http://www.ru....
Especially check out the 'Programming Ruby' book and read up Ruby's regular
expressions, if you aren't familiar with them.

Hope that helps!

On 3/1/06, spam_monkey <"sidhellfire(spam_monkey)"@o2.pl> wrote:
>
> Hi,
> I wanted to learn something, and choosed ruby,
> since it looked awesome, and can't say it isn't.
> I am not expierienced programmer (tried some
> Pascal, then PHP), and decided to do something
> small, but usefull. Let's get straight into the
> problem.
>
> At the url:
>
> https://www.knighto....com/index.php?pg=rankings&sub=2&radServer=1&clanid=12199
> I've got something similar to html (at least
> it's not table based, but still damn ugly code
> there) with online statistics.
>
> I don't want to parse that one, i just want to
> 'crop that crap' and retrive informations from
> data inside one element a div with an
> id="bleet". There are tables in there, but seems
> impossible to navigate there.
>
> I am really suck at strings :(
>
> <tr bgcolor="#FFFFFF">
> <td align="center" >Quatrina</td> # i want this
> <td align="center" >52</td> #
> <td align="center" >Shaman</td> #
> <td align="center" >563</td> # and this one
> </tr> # preferable everything :P
>
> I would want to hash that data, to have it
> usefull in future (aiming a rails app in
> future), but selecting two columns in each row,
> located in the last table in the id'ed element
> placed in not-well-made document looks
> impossible for me. I don't even know where to
> start, and it's FAR away from the things i
> wanted to do (counting numbers*, assinging
> additional data).
>
>
> All i've done already is getting the document:
>
> > require 'uri'
> > require 'net/http'
> >
> > trg = "
> https://www.knighto....com/index.php?pg=rankings&sub=2&radServer=1&clanid=12199
> "
> > puts 'processing' + cel + " :\n"
> >
> > r = Net::HTTP.get_response(URI.parse(trg).host, URI.parse(trg).path)
> >
> > puts r.body
>
>
> Thanks for reading.
>
> */Notice it does count "Loyality" wrong/
>
>

Charlie Bowman

3/1/2006 5:21:00 PM

Here's some sample code you might enjoy. It's a random chuck norris
joke generator that pulls the jokes off of a website.

require 'net/http'

page='http://www.4q.cc/index.php?pid=fact&person=...
res = Net::HTTP.get(URI.parse(page))
res.scan(/(<\/h1>)(.*)(<hr \/>)/)
puts ($2 || 'No fact was found!')

On Thu, 2006-03-02 at 01:43 +0900, Justin Bailey wrote:

> First, don't use Net::HTTP. Require 'open-uri' at the top and you can
> simplify your code a lot:
>
> open(https://www.knighto...
> >
> > .com/index.php?pg=rankings&sub=2&radServer=1&clanid=12199) do |page|
>
>
> html = page.gets(nil)
> end
>
>
> Which will get the whole document into the 'html' variable.
>
> Next, look at StringScanner and using regular expressions. It will allow you
> to iterate through your document quickly and pick up the columns you want.
> Some pseudo-code might look like:
>
> scanner = StringScanner.new
> while scanner.check(/<tr>.*<td>(.*)<\/td>.*<td>.*<\/td>.*<td>.*<\/td>.*<td>(.*)<\/td>.*</tr>/m)
> do
> name = scanner[1]
> points = scanner[2]
> end
>
> That will extract the name and level from each row. The parantheses in the
> regular expression are "capture groups", and they relate to the assignments
> in the loop (name = scanner[1], points = scanner[2]). The 'm' following the
> regular expression makes sure a multi-line match is performed, which is
> probably necessary as the table cells are on different lines.
>
> For further syntax and library help check http://www.ru....
> Especially check out the 'Programming Ruby' book and read up Ruby's regular
> expressions, if you aren't familiar with them.
>
> Hope that helps!
>
>
> On 3/1/06, spam_monkey <"sidhellfire(spam_monkey)"@o2.pl> wrote:
> >
> > Hi,
> > I wanted to learn something, and choosed ruby,
> > since it looked awesome, and can't say it isn't.
> > I am not expierienced programmer (tried some
> > Pascal, then PHP), and decided to do something
> > small, but usefull. Let's get straight into the
> > problem.
> >
> > At the url:
> >
> > https://www.knighto....com/index.php?pg=rankings?=2&radServer=1&clanid=12199
> > I've got something similar to html (at least
> > it's not table based, but still damn ugly code
> > there) with online statistics.
> >
> > I don't want to parse that one, i just want to
> > 'crop that crap' and retrive informations from
> > data inside one element a div with an
> > id="bleet". There are tables in there, but seems
> > impossible to navigate there.
> >
> > I am really suck at strings :(
> >
> > <tr bgcolor="#FFFFFF">
> > <td align="center" >Quatrina</td> # i want this
> > <td align="center" >52</td> #
> > <td align="center" >Shaman</td> #
> > <td align="center" >563</td> # and this one
> > </tr> # preferable everything :P
> >
> > I would want to hash that data, to have it
> > usefull in future (aiming a rails app in
> > future), but selecting two columns in each row,
> > located in the last table in the id'ed element
> > placed in not-well-made document looks
> > impossible for me. I don't even know where to
> > start, and it's FAR away from the things i
> > wanted to do (counting numbers*, assinging
> > additional data).
> >
> >
> > All i've done already is getting the document:
> >
> > > require 'uri'
> > > require 'net/http'
> > >
> > > trg = "
> > https://www.knighto....com/index.php?pg=rankings?=2&radServer=1&clanid=12199
> > "
> > > puts 'processing' + cel + " :\n"
> > >
> > > r = Net::HTTP.get_response(URI.parse(trg).host, URI.parse(trg).path)
> > >
> > > puts r.body
> >
> >
> > Thanks for reading.
> >
> > */Notice it does count "Loyality" wrong/
> >
> >

Charlie Bowman
http://www.recentr...

greg.rb

3/1/2006 5:30:00 PM

I tried:

require "open-uri"
trg= open
"https://www.knightonlineworld.com/index.php?pg=rankings&sub=2&radServer=1&clanid=1...

trg.each do |line|
puts line
end

Result:
c:/ruby/lib/ruby/1.8/open-uri.rb:583:in `proxy_open': open-uri doesn't
support https. (ArgumentError)
from c:/ruby/lib/ruby/1.8/open-uri.rb:525:in `direct_open'
from c:/ruby/lib/ruby/1.8/open-uri.rb:169:in `open_loop'
from c:/ruby/lib/ruby/1.8/open-uri.rb:164:in `catch'
from c:/ruby/lib/ruby/1.8/open-uri.rb:164:in `open_loop'
from c:/ruby/lib/ruby/1.8/open-uri.rb:134:in `open_uri'
from c:/ruby/lib/ruby/1.8/open-uri.rb:424:in `open'
from c:/ruby/lib/ruby/1.8/open-uri.rb:85:in `open'
from Knight4.rb:2

doesn't look like it liked https

comp.lang.ruby

[Newbie] Getting data from html-ish like crap.

sidhellfire

William James

jgbailey

Charlie Bowman

greg.rb

x Login to ForumsZone