Chris Carter
12/18/2006 1:01:00 PM
Henry, There was some just a few days ago who had a problem with using
Hpricot, and IMG elements in P tags. Paul must have gotten you two
confused.
On 12/18/06, Henry Maddocks <henryj@paradise.net.nz> wrote:
>
> On 17/12/2006, at 11:15 PM, Paul Lutus wrote:
>
> > Henry Maddocks wrote:
> >
> >> Sorry, try again...
> >>
> >> Not sure where to send this, sorry if it's not the right place...
> >>
> >> The html in the attached file renders 'correctly' in the 3 browsers I
> >> have tried but it tricks hpricot because of the second malformed
> >> comment. When I say correctly I mean I get to see 'Some text'. I
> >> guess it could be argued that this is incorrect. For my application
> >> it would be nice if hpricot behaved like a browser.
>
> Paul,
>
> before I address your response directly I will say that I am aware of
> your crusade against html parsing libraries and while I believe you
> are entitled to your opinion, I disagree with it. I have done enough
> of this sort of thing to know that, for me, the level of abstraction
> that these libraries gives is both beneficial in development time and
> maintenance. I am neither an html nuby, nor a ruby nuby. I am also
> aware that my needs may not match those of some one else so I'm not
> going to ram my opinions down there throat every time they ask for a
> little help.
>
>
> > You have created a new thread, and you have not attached any prior
> > text.
> > This requires us to start over.
>
> As this is the first time I have posted on this subject, that much is
> obvious. Unless I am missing something.
>
>
> > Tell us what you hoped would happen, what happened instead, and how
> > they
> > differ.
>
> Run the script and that too will be obvious.
>
>
> > If your goal is to filter particular content from HTML pages, just
> > say so,
> > and be specific about what you want and don't want. Given this
> > information,
> > I will show you how to extract the desired content with a few lines of
> > Ruby, no fuss, no undue complexity, no Hpricot.
>
> My goal is to highlight an issue I found with a particular library
> and provide some sample code that shows the problem with the minimum
> amount of code. I posted it here so that there may be some discussion
> with interested people as to the desired behaviour.
>
>
> > IIRC, you had asked for help using Hpricot to extract text between
> > <p> and
> > </p> tag pairs, but with the added requirement that there be an IMG
> > tag
> > within the <p> ... </p> tag pair to validate the case. Is this
> > still the
> > goal? If so, how did my previously posted, simple solution work out
> > for
> > you?
>
> What IMG tag? There isn't one in the sample code. What previous
> solution? You do not recall correctly.
>
>
> > This is a scene in a much larger play, one in which someone says,
> > "Wow, I
> > had no idea there was such a powerful library, so carefully
> > designed, so
> > complete. But, notwithstanding its extraordinary features,
> > notwithstanding
> > the hundreds of man-hours expended creating it ... I can't get it
> > to do
> > what I want."
>
> The incident that that prompted my post went thus...
> I had a page that seemed to render fine in a browser but when parsing
> it my code failed. I inspected the html and found a malformed comment
> to be the problem. Probably put there to stop screen scraping. I
> wrote a bit of code, using regexps no less, that removed the
> offending comment and hpricot then went on it's merry way. Job done.
> I thought others may be interested so I posted some sample code. I am
> now regretting that decision.
>
>
> > This is a very common refrain. I think I can solve your problem
> > with a few
> > lines of Ruby code, code that you can easily understand and adapt to
> > specific and evolving requirements. And if I cannot do this, I will
> > say so.
>
> I could too, but I don't care.
>
>
> > --
> > Paul Lutus
>
> Thanks for hijacking my thread. Thanks for nothing.
>
>
>
--
Chris Carter
concentrationstudios.com
brynmawrcs.com