Asp Forum - parsing html table cells

lrlebron@gmail.com

11/11/2006 1:28:00 PM

I am trying to parse an html page that has strings that looks like this

<tr class="bg2" height="17" valign="middle" align="right"><td
align="left"></td><td>4</td><td>47</td><td>1</td><td>19</td></tr> to
get the numbers inside the table cells.

I would to end up with a simple string that looks like this (for this
row)
4 47 1 19

The number of table cells in a row that have numbers may vary for
different rows.
I'm new to Ruby so bear with me. I'm also learning to use hpricot and
have been able get the table rows using it

thanks,

Luis

3 Answers

David Vallner

11/11/2006 1:54:00 PM

lrlebron@gmail.com wrote:
> I am trying to parse an html page that has strings that looks like this
>
> <tr class="bg2" height="17" valign="middle" align="right"><td
> align="left"></td><td>4</td><td>47</td><td>1</td><td>19</td></tr> to
> get the numbers inside the table cells.
>
> I would to end up with a simple string that looks like this (for this
> row)
> 4 47 1 19
>
> The number of table cells in a row that have numbers may vary for
> different rows.
> I'm new to Ruby so bear with me. I'm also learning to use hpricot and
> have been able get the table rows using it
>

I'd use XPath, I'm not sure if that's doable with hpricot CSS selectors
or its (admittedly, I think) basic XPath support.

If you know the webpage is valid xhtml, I'd say switch to REXML, if not,
massage with tidy (maybe hpricot can do this better too) and then switch
to REXML.

The code would probably be something like (where doc is the REXML document):

bg2_strings = doc.elements.to_a(%{//tr[@class='bg2']}).map { | bg2_row |
bg2_row.elements.to_a('td').map { |cell| cell.text }.join('
').strip.gsub(/\s+/, ' ')
}

Which might be horribly wrong, because I find REXML's XPath API hard to
memorise. YMMV. (It also hates the text() axis specifier with a passion,
whence the second map.)

David Vallner

lrlebron@gmail.com

11/11/2006 3:23:00 PM

Thanks for your help. I was able to get it with some hpricot code

intCells = tr.search("td").length

1.upto(intCells-1) do |i|
print tr.search("td:eq(#{i})").inner_html + ' '
end

thanks,

Luis

David Vallner wrote:
> lrlebron@gmail.com wrote:
> > I am trying to parse an html page that has strings that looks like this
> >
> > <tr class="bg2" height="17" valign="middle" align="right"><td
> > align="left"></td><td>4</td><td>47</td><td>1</td><td>19</td></tr> to
> > get the numbers inside the table cells.
> >
> > I would to end up with a simple string that looks like this (for this
> > row)
> > 4 47 1 19
> >
> > The number of table cells in a row that have numbers may vary for
> > different rows.
> > I'm new to Ruby so bear with me. I'm also learning to use hpricot and
> > have been able get the table rows using it
> >
>
> I'd use XPath, I'm not sure if that's doable with hpricot CSS selectors
> or its (admittedly, I think) basic XPath support.
>
> If you know the webpage is valid xhtml, I'd say switch to REXML, if not,
> massage with tidy (maybe hpricot can do this better too) and then switch
> to REXML.
>
> The code would probably be something like (where doc is the REXML document):
>
> bg2_strings = doc.elements.to_a(%{//tr[@class='bg2']}).map { | bg2_row |
> bg2_row.elements.to_a('td').map { |cell| cell.text }.join('
> ').strip.gsub(/\s+/, ' ')
> }
>
> Which might be horribly wrong, because I find REXML's XPath API hard to
> memorise. YMMV. (It also hates the text() axis specifier with a passion,
> whence the second map.)
>
> David Vallner
>
>
> --------------enigB38FA39D7D2640E58C81CF92
> Content-Type: application/pgp-signature
> Content-Disposition: inline;
> filename="signature.asc"
> Content-Description: OpenPGP digital signature
> X-Google-AttachSize: 188

Paul Lutus

11/12/2006 12:29:00 AM

lrlebron@gmail.com wrote:

> I am trying to parse an html page that has strings that looks like this
>
> <tr class="bg2" height="17" valign="middle" align="right"><td
> align="left"></td><td>4</td><td>47</td><td>1</td><td>19</td></tr> to
> get the numbers inside the table cells.
>
> I would to end up with a simple string that looks like this (for this
> row)
> 4 47 1 19
>
> The number of table cells in a row that have numbers may vary for
> different rows.

Try this:

-------------------------------------

#!/usr/bin/ruby -w

table = "<table><tr>" +
"<td>4</td><td>47</td><td>1</td><td>19</td></tr>" +
"<tr><td>7</td><td>49</td><td>4</td><td>39</td></tr>" +
"<tr><td>14</td><td>17</td><td>19</td><td>21</td>" +
"</tr></table>"

rows = table.scan(%r{<tr>.*?</tr>})

rows.each do |row|
fields = row.scan(%r{<td>(.*?)</td>})
puts fields.join(",")
end

-------------------------------------

Output:

4,47,1,19
7,49,4,39
14,17,19,21

--
Paul Lutus
http://www.ara...

comp.lang.ruby

parsing html table cells

lrlebron@gmail.com

David Vallner

lrlebron@gmail.com

Paul Lutus

x Login to ForumsZone