Bernard Kenik
12/25/2006 7:24:00 AM
Gregory Seidman wrote:
> I missed the original post of this thread, but I recommend that the
> original poster just ask how to do what s/he is trying to do with Hpricot.
> I've been using with good success for a while, and I've mostly learned it
> from playing around in irb. I am willing to help if there is a specific
> question.
>
> Ideally someone (maybe even me) will get around to writing comprehensive
> documentation for Hpricot. Until then, well, there are a few of us who have
> used it enough to answer questions helpfully, including _why himself. He's
> been known to take offhand suggestions to heart and implement them.
>
Thank you for your offer of assistance
In simple term I need to extract text from a <p ..... /p> element ...
yes I know that I can use "traverse_text" That works fine except when
there is no text at an expected location.
Here is an example of what I mean. I am afraid that the <p ... /p> is
rather long and that it will get wrapped up in the wrong places.
<p class="MsoNormal"><span
style="font-size:11.0pt;font-family:Helvetica">W</span><span
style="font-size:9.0pt;font-family:Helvetica">EST</span><span
style="font-size:11.0pt;font-family:Helvetica"><br />
<img src="Bermuda%20Bowl%20Final_files/s.gif" border="0"
id="_x0000_i1505" height="11" width="13" />8 7 6 2 <br />
<img src="Bermuda%20Bowl%20Final_files/h.gif" border="0"
id="_x0000_i1506" height="11" width="13" /><br />
<img src="Bermuda%20Bowl%20Final_files/d.gif" border="0"
id="_x0000_i1507" height="11" width="13" />K 7 6 4 <br />
<img src="Bermuda%20Bowl%20Final_files/c.gif" border="0"
id="_x0000_i1508" height="11" width="13" />Q 7 6 3 2
<o:p></o:p></span></p>
Notice that after each <img ... /> element there is some text except
after the element with
h.gif file. Now it is perfectly alright to have no text. The problem
is that I need to know which img element is not followed by some text
since it could in any of the four locations, In fact, there could more
than one occurrence of no text.
What I would like to do is insert the basename of src filename followed
by a space after the img element
Thus the text strings would read "s 8 7 6 2", "h ", "d K 7 6 4", "c Q
7 6 3 2"
as opposed to "8 7 6 2", "K 7 6 4", "Q 7 6 3 2"
An alternate way would be to simply insert arbritary text ('x ") after
the img element... no need to extract the basename. .. the s h d c
sequence is constant.
This would let traverse_text yield "x 8 7 6 2", "x ", "x K 7 6 4", "x Q
7 6 3 2"
A single character (x) would also work.
Actually, I prefer the alternate way...Occam's Razor
I know that I can do both using regexp. but I am trying to learn and
understand hpricot.
I understand that hpricot can modify (edit), remove, and add elements.
It is definitely stronger and faster. It also offers a more generic way
of processing html files.
So in simple terms, the problem is how to insert text after an element.