Asp Forum - html parsing using regular expressions

Anthony Walsh

10/25/2006 2:57:00 AM

I'm new to Ruby and trying to use regular expressions to parse an html
file. The page is a large table with no spaces in the html code. I want
to count the number of times <tr> or <tr 'anything'> occurs. I'm stuck
on trying to match every variety of <tr>

I've tried

op_file = File.read(htmlfile)
if op_file =~ /(<tr(.*?)>)+/

but it catches the first <tr and matches all the way to the end of the
file. Anyone have any advice on matching and counting?

-Shinkaku

--
Posted via http://www.ruby-....

2 Answers

Austin Ziegler

10/25/2006 5:02:00 AM

On 10/24/06, Anthony Walsh <akakuda@excite.com> wrote:
> I'm new to Ruby and trying to use regular expressions to parse an html
> file.

Don't. Use Hpricot instead. Your brain will thank you for it.

I haven't used Hpricot, but I've heard great things about it; I've
tried to do HTML parsing with regexen, and it's a mook's game.

-austin
--
Austin Ziegler * halostatue@gmail.com * http://www.halo...
* austin@halostatue.ca * http://www.halo...feed/
* austin@zieglers.ca

Paul Lutus

10/25/2006 5:20:00 AM

Anthony Walsh wrote:

> I'm new to Ruby and trying to use regular expressions to parse an html
> file. The page is a large table with no spaces in the html code. I want
> to count the number of times <tr> or <tr 'anything'> occurs. I'm stuck
> on trying to match every variety of <tr>
>
> I've tried
>
> op_file = File.read(htmlfile)
> if op_file =~ /(<tr(.*?)>)+/
>
> but it catches the first <tr and matches all the way to the end of the
> file. Anyone have any advice on matching and counting?

You need to tell us whether you have read the replies you received to this
same question when you asked it eight hours ago. I answered your question,
several others did also, you have not given any indication that you saw the
replies.

Here is one answer:

#!/usr/bin/ruby -w

path="path-to-HTML-page"

data = File.read(path)

array = data.scan(%r{<tr.*?>})

puts array.size # gives a count of occurrences

puts array # shows the matches

Please read replies before posting again.

--
Paul Lutus
http://www.ara...

comp.lang.ruby

html parsing using regular expressions

Anthony Walsh

Austin Ziegler

Paul Lutus

x Login to ForumsZone