Asp Forum - parsing HTML code with regex

Anthony Walsh

10/24/2006 6:35:00 PM

I'm trying to parse through some html code and count the number of times
a match happens. The file is a large table with a ton of <tr> and <tr
'something'>. There are no spaces in the file. I'm trying to count and
print each <tr> and <tr 'something'>.

I haven't even gotten to counting my matches. I'm still working on
matching with <tr> or <tr 'anything'>

I've done:

op_file = HTML_CODE
if op_file =~ /(<tr(.*?)>)+/

but it catches everything on from the first <tr to the end of the line.
Any ideas?

-Shinkaku

--
Posted via http://www.ruby-....

5 Answers

Paul Lutus

10/24/2006 6:51:00 PM

Anthony Walsh wrote:

> I'm trying to parse through some html code and count the number of times
> a match happens. The file is a large table with a ton of <tr> and <tr
> 'something'>. There are no spaces in the file. I'm trying to count and
> print each <tr> and <tr 'something'>.
>
> I haven't even gotten to counting my matches. I'm still working on
> matching with <tr> or <tr 'anything'>
>
> I've done:
>
> op_file = HTML_CODE
> if op_file =~ /(<tr(.*?)>)+/

You want if op_file =~ /<tr.*?>/

But see below.

>
> but it catches everything on from the first <tr to the end of the line.

Also, try scanning for matches, like this:

#!/usr/bin/ruby -w

path="path-to-HTML-page"

data = File.read(path)

array = data.scan(%r{<tr.*?>})

puts array

--
Paul Lutus
http://www.ara...

Phlip

10/24/2006 7:02:00 PM

Anthony Walsh wrote:

> I'm trying to parse through some html code and count the number of times
> a match happens.

If the code is not yet XHTML, use Tidy to upgrade it.

Then parse it with XPath, looking for your match.

(Tip: All HTML that you control should be XHTML, of the highest quality.
Don't rely on sloppy HTML and "browser forgiveness"!)

--
Phlip
http://www.greencheese.u... <-- NOT a blog!!!

Michael Perle

10/24/2006 8:31:00 PM

Anthony Walsh wrote:
> I'm trying to parse through some html code and count the number of times
> a match happens. The file is a large table with a ton of <tr> and <tr
> 'something'>. There are no spaces in the file. I'm trying to count and
> print each <tr> and <tr 'something'>.
>
> I haven't even gotten to counting my matches. I'm still working on
> matching with <tr> or <tr 'anything'>
>
> I've done:
>
> op_file = HTML_CODE
> if op_file =~ /(<tr(.*?)>)+/

You are parsing always one line only.
Perhaps you mean a Regular Expression like

/(<tr([^>]*?>)+/m

Anyway I am not sure if the if... is the right
construct. Don't you want to get the return value
of the match, which delivers you a MatchData
object from which you can get the results as
an array or so.

MP

dblack

10/25/2006 12:46:00 PM

Anthony Walsh

10/25/2006 3:15:00 PM

> Also, try scanning for matches, like this:
>
> #!/usr/bin/ruby -w
>
> path="path-to-HTML-page"
>
> data = File.read(path)
>
> array = data.scan(%r{<tr.*?>})
>
> puts array

Thanks, this worked.

--
Posted via http://www.ruby-....

comp.lang.ruby

parsing HTML code with regex

Anthony Walsh

Paul Lutus

Phlip

Michael Perle

dblack

Anthony Walsh

x Login to ForumsZone