Asp Forum - Regexp Ruby selection

touffik@gmail.com

7/25/2008 7:41:00 AM

Hi folks,
I'm trying to code a ruby script that select the content of a HTML
table in a HTML page.
I used rubular to test my regexp syntax which is
/ <td class="TabIntCenContenuto"[^>]*>(.*)  /
with rubular the result of my expression is :
Result 1
1. 12345678
Result 2
1. SAN FRANCESCO DA PAOLA
Result 3
1. Via San Francesco Da Paola, 10
Result 4
1. 10123
Result 5
1. TORINO
etc....
But with my script :

File.open('D:/testt/1.txt', 'r') do |filein|

while line = filein.gets
p line if line =~ /<td class="TabIntCenContenuto"[^>]*>/ .. line
=~ /\/A /
end
fileout.puts p
end
end

I got this result
"</td><td class=\"TabIntCenContenuto\">12345678 \n"
"</td><td class=\"TabIntCenContenuto\">SAN FRANCESCO DA PAOLA </
td>\n"
"<td class=\"TabIntCenContenuto\">Via San Francesco Da Paola,
10 </td>\n"
"<td class=\"TabIntCenContenuto\">10123 </td>\n"
"<td class=\"TabIntCenContenuto\" align=\"left\">TORINO </td>\n"

I thought the .. between 2 "line =~" was like (...) in rubular which
let catch the content ??
Moreover I would like to transform this html code in XML. But I can"t
find an idea how to transform these HTML line in XML.
<root>
<number>12345678</number>
But there is no attribut 'name' or wathever in the <td> so making and
match/replace would be difficult ?
...

So, if someone can help me I would be very grateful.
Nice day ;)

5 Answers

Srijayanth Sridhar

7/25/2008 7:44:00 AM

Any specific reason you can't use hpricot or other HTML parsers?

Jayanth

On Fri, Jul 25, 2008 at 1:09 PM, touffik@gmail.com <touffik@gmail.com> wrote:
> Hi folks,
> I'm trying to code a ruby script that select the content of a HTML
> table in a HTML page.
> I used rubular to test my regexp syntax which is
> / <td class="TabIntCenContenuto"[^>]*>(.*)  /
> with rubular the result of my expression is :
> Result 1
> 1. 12345678
> Result 2
> 1. SAN FRANCESCO DA PAOLA
> Result 3
> 1. Via San Francesco Da Paola, 10
> Result 4
> 1. 10123
> Result 5
> 1. TORINO
> etc....
> But with my script :
>
> File.open('D:/testt/1.txt', 'r') do |filein|
>
> while line = filein.gets
> p line if line =~ /<td class="TabIntCenContenuto"[^>]*>/ .. line
> =~ /\/A /
> end
> fileout.puts p
> end
> end
>
> I got this result
> "</td><td class=\"TabIntCenContenuto\">12345678 \n"
> "</td><td class=\"TabIntCenContenuto\">SAN FRANCESCO DA PAOLA </
> td>\n"
> "<td class=\"TabIntCenContenuto\">Via San Francesco Da Paola,
> 10 </td>\n"
> "<td class=\"TabIntCenContenuto\">10123 </td>\n"
> "<td class=\"TabIntCenContenuto\" align=\"left\">TORINO </td>\n"
>
> I thought the .. between 2 "line =~" was like (...) in rubular which
> let catch the content ??
> Moreover I would like to transform this html code in XML. But I can"t
> find an idea how to transform these HTML line in XML.
> <root>
> <number>12345678</number>
> But there is no attribut 'name' or wathever in the <td> so making and
> match/replace would be difficult ?
> ..
>
> So, if someone can help me I would be very grateful.
> Nice day ;)
>
>

Shadowfirebird

7/25/2008 7:50:00 AM

I'll second that. Hpricot is really quite remarkable. It'll almost
certainly save you days and days of pain. Unless you are doing this
for fun / learning, of course.

On Fri, Jul 25, 2008 at 8:44 AM, Srijayanth Sridhar
<srijayanth@gmail.com> wrote:
> Any specific reason you can't use hpricot or other HTML parsers?
>
> Jayanth
>
> On Fri, Jul 25, 2008 at 1:09 PM, touffik@gmail.com <touffik@gmail.com> wrote:
>> Hi folks,
>> I'm trying to code a ruby script that select the content of a HTML
>> table in a HTML page.
>> I used rubular to test my regexp syntax which is
>> / <td class="TabIntCenContenuto"[^>]*>(.*)  /
>> with rubular the result of my expression is :
>> Result 1
>> 1. 12345678
>> Result 2
>> 1. SAN FRANCESCO DA PAOLA
>> Result 3
>> 1. Via San Francesco Da Paola, 10
>> Result 4
>> 1. 10123
>> Result 5
>> 1. TORINO
>> etc....
>> But with my script :
>>
>> File.open('D:/testt/1.txt', 'r') do |filein|
>>
>> while line = filein.gets
>> p line if line =~ /<td class="TabIntCenContenuto"[^>]*>/ .. line
>> =~ /\/A /
>> end
>> fileout.puts p
>> end
>> end
>>
>> I got this result
>> "</td><td class=\"TabIntCenContenuto\">12345678 \n"
>> "</td><td class=\"TabIntCenContenuto\">SAN FRANCESCO DA PAOLA </
>> td>\n"
>> "<td class=\"TabIntCenContenuto\">Via San Francesco Da Paola,
>> 10 </td>\n"
>> "<td class=\"TabIntCenContenuto\">10123 </td>\n"
>> "<td class=\"TabIntCenContenuto\" align=\"left\">TORINO </td>\n"
>>
>> I thought the .. between 2 "line =~" was like (...) in rubular which
>> let catch the content ??
>> Moreover I would like to transform this html code in XML. But I can"t
>> find an idea how to transform these HTML line in XML.
>> <root>
>> <number>12345678</number>
>> But there is no attribut 'name' or wathever in the <td> so making and
>> match/replace would be difficult ?
>> ..
>>
>> So, if someone can help me I would be very grateful.
>> Nice day ;)
>>
>>
>
>

--
Me, I imagine places that I have never seen / The colored lights in
fountains, blue and green / And I imagine places that I will never go
/ Behind these clouds that hang here dark and low
But it's there when I'm holding you / There when I'm sleeping too /
There when there's nothing left of me / Hanging out behind the
burned-out factories / Out of reach but leading me / Into the
beautiful sea

Sebastian Hungerecker

7/25/2008 7:59:00 AM

touffik@gmail.com wrote:
> I thought the .. between 2 "line =~" was like (...) in rubular which
> let catch the content ??

Generally in ruby .. denotes a range. Like starting_value .. end_value.
In this case though it denotes a flip flop, which is evil and should never
ever be used because it makes my head hurt. Here's what it does though:
some_loop {
do_something if foo .. bar
}
This will do nothing until foo is true. When foo is true it will do_something.
It will then keep doing_something in every iteration of the loop until bar
becomes true. After bar became true it will stop doing_something until foo is
true again.
So as a summary: It doesn't do what you thought it did. As a matter of fact it
doesn't do anything sane. So just keep as far away from it as possible.

HTH,
Sebastian
--
Jabber: sepp2k@jabber.org
ICQ: 205544826

touffik@gmail.com

7/25/2008 9:15:00 AM

On Fri, Jul 25, 2008 at 8:44 AM, Srijayanth Sridhar
<srijaya...@gmail.com> wrote:
> Any specific reason you can't use hpricot or other HTML parsers?

I didn't know this tool for ruby. I used once a parser named tidy but
that's all. I'll try now and let you know.

On Jul 25, 9:58 am, Sebastian Hungerecker <sep...@googlemail.com>
wrote:
> Generally in ruby .. denotes a range. Like starting_value .. end_value.
> In this case though it denotes a flip flop, which is evil and should never
> ever be used because it makes my head hurt. Here's what it does though:
> some_loop {
> do_something if foo .. bar}
>
> This will do nothing until foo is true. When foo is true it will do_something.
> It will then keep doing_something in every iteration of the loop until bar
> becomes true. After bar became true it will stop doing_something until foo is
> true again.
> So as a summary: It doesn't do what you thought it did. As a matter of fact it
> doesn't do anything sane. So just keep as far away from it as possible.
>

So i was wrong .. Thanks you for your explaination of this wrong use
of the loop.

Thanks you.

Peña, Botp

7/25/2008 9:46:00 AM

From: touffik@gmail.com [mailto:touffik@gmail.com]=20
# I'm trying to code a ruby script that select the content of a HTML
# table in a HTML page.
# I used rubular to test my regexp syntax which is
# / <td class=3D"TabIntCenContenuto"[^>]*>(.*)  /

the re is fine, you can use that

# with rubular the result of my expression is :
# Result 1
# 1. 12345678
# Result 2
# 1. SAN FRANCESCO DA PAOLA
# Result 3
# 1. Via San Francesco Da Paola, 10
# Result 4
# 1. 10123
# Result 5
# 1. TORINO
# etc....
# But with my script :
#=20
# File.open('D:/testt/1.txt', 'r') do |filein|
# while line =3D filein.gets
# p line if line =3D~ /<td class=3D"TabIntCenContenuto"[^>]*>/ .. =
line
# =3D~ /\/A /
# end
# fileout.puts p
# end
# end
# I got this result
# "</td><td class=3D\"TabIntCenContenuto\">12345678 \n"
# "</td><td class=3D\"TabIntCenContenuto\">SAN FRANCESCO DA =
PAOLA </
# td>\n"
# "<td class=3D\"TabIntCenContenuto\">Via San Francesco Da Paola,
# 10 </td>\n"
# "<td class=3D\"TabIntCenContenuto\">10123 </td>\n"
# "<td class=3D\"TabIntCenContenuto\" =
align=3D\"left\">TORINO </td>\n"

you already got it, but you did not capture

sample code & run,

botp@botp-desktop:~$ cat test.rb
File.open('test.txt') do |f|
while line =3D f.gets
if line=3D~/<td class=3D"TabIntCenContenuto"[^>]*>(.*) /
p $1
end
end
end

botp@botp-desktop:~$ ruby test.rb
"12345678"
"SAN FRANCESCO DA PAOLA"
"Via San Francesco Da Paola,10"
"10123"
"TORINO"
=20
# I thought the .. between 2 "line =3D~" was like (...) in rubular which
# let catch the content ??

you are making it harder. keep it simple.

# Moreover I would like to transform this html code in XML. But I can"t
# find an idea how to transform these HTML line in XML.
# <root>
# <number>12345678</number>
# But there is no attribut 'name' or wathever in the <td> so making and
# match/replace would be difficult ?

if the html is nicely formatted, you can loop through the table.=20
if you want to be sure, try outputting all the data you can capture =
first. Then output that again with xml tags inserted.

do not worry. xml, like html, is just text w tags. Manipulating text is =
a good learning exercise for ruby.

kind regards -botp

comp.lang.ruby

Regexp Ruby selection

touffik@gmail.com

Srijayanth Sridhar

Shadowfirebird

Sebastian Hungerecker

touffik@gmail.com

Peña, Botp

x Login to ForumsZone