Asp Forum - hpricot problem

Henry Maddocks

12/17/2006 9:56:00 AM

Sorry, try again...

Not sure where to send this, sorry if it's not the right place...

The html in the attached file renders 'correctly' in the 3 browsers I
have tried but it tricks hpricot because of the second malformed
comment. When I say correctly I mean I get to see 'Some text'. I
guess it could be argued that this is incorrect. For my application
it would be nice if hpricot behaved like a browser.

Henry

10 Answers

Paul Lutus

12/17/2006 10:12:00 AM

Henry Maddocks wrote:

> Sorry, try again...
>
> Not sure where to send this, sorry if it's not the right place...
>
> The html in the attached file renders 'correctly' in the 3 browsers I
> have tried but it tricks hpricot because of the second malformed
> comment. When I say correctly I mean I get to see 'Some text'. I
> guess it could be argued that this is incorrect. For my application
> it would be nice if hpricot behaved like a browser.

You have created a new thread, and you have not attached any prior text.
This requires us to start over.

Tell us what you hoped would happen, what happened instead, and how they
differ.

If your goal is to filter particular content from HTML pages, just say so,
and be specific about what you want and don't want. Given this information,
I will show you how to extract the desired content with a few lines of
Ruby, no fuss, no undue complexity, no Hpricot.

IIRC, you had asked for help using Hpricot to extract text between and
 tag pairs, but with the added requirement that there be an IMG tag
within the ... tag pair to validate the case. Is this still the
goal? If so, how did my previously posted, simple solution work out for
you?

This is a scene in a much larger play, one in which someone says, "Wow, I
had no idea there was such a powerful library, so carefully designed, so
complete. But, notwithstanding its extraordinary features, notwithstanding
the hundreds of man-hours expended creating it ... I can't get it to do
what I want."

This is a very common refrain. I think I can solve your problem with a few
lines of Ruby code, code that you can easily understand and adapt to
specific and evolving requirements. And if I cannot do this, I will say so.

--
Paul Lutus
http://www.ara...

Peter Szinek

12/17/2006 10:52:00 AM

Hello,

> Given this information,
> I will show you how to extract the desired content with a few lines of
> Ruby, no fuss, no undue complexity, no Hpricot.
Why should it be complicated? What fuss? Who needs few lines? With the
current version of hpricot this is exactly one line:

doc//p[img]//text()

> This is a scene in a much larger play, one in which someone says, >
"Wow, I
> had no idea there was such a powerful library, so carefully designed,
> so
> complete. But, notwithstanding its extraordinary features, > > > > > >
> notwithstanding
> the hundreds of man-hours expended creating it ... I can't get it to >
> do what I want."

You know, software is an evolving stuff. 3 (or 4, or something like
this) days ago the above stuff was not available in HPricot, and since
it was such a common query, and requested by people. voila: now it is
there.

Of course there will always be some missing features - no framework or
library can solve all the problems of all mankind - but after some time,
useful feedback (i.e. not 'forget about every framework since you can do
it in a few lines of Ruby' but rather feature requests, bug reports etc)
a framework can reach a maturity level where is solves most of the
problems of its users.

Btw. ever heard of 'reinventing the wheel'?

Also your (otherwise great) code snippets always assume that the
underlying HTML is well formed, and x and y and z - which is in real
life almost never the case. Of course the posters here are not pasting
200K of HTML against which they run they production code, but a few
lines of example which is usually an oversimplification of the problem.

This another point where such libraries are great: they handle 844747
special cases (if your case is not among them, see the current-2nd
paragraph, or add it there on your own) which is always a problematic
thing in case of hand written stuff.

I could state here 100 another points which would prove that in
production, libraries are almost always better choice over hand written
code on the fly - of course learning Ruby, playing with some features
etc is another thing. I am not arguing that in this case one should not
code everything on his own. However, there are some cases when people
need a stable, working solution for something and don't want to play
around with hand coded regexps against crappy HTML. In this case, IMHO,
using a framework is absolutely OK.

Cheers,
Peter

__
http://www.rubyra...

Best wishes,
Peter

__
http://www.rubyra...

Peter Szinek

12/17/2006 1:44:00 PM

> The html in the attached file renders 'correctly' in the 3 browsers I
> have tried but it tricks hpricot because of the second malformed
> comment. When I say correctly I mean I get to see 'Some text'. I guess
> it could be argued that this is incorrect.

What are you trying to do? Matching that comment? Or matching the text
'Some text'? Which version of Hpricot do you use (svn head or 0.4)? What
exactly is the problem?

> For my application it would be nice if hpricot behaved like a browser.
Well, if this is the goal, then use a browser :-). Hpricot is not a
browser and it does not try to be one.

I am working on a project with Java where we are using Mozilla/FireFox
XULRunner to parse the HTML (and to communicate with FF) and it's
really, really robust and fast and reliable and and and. However, AFAIK
this is not doable in Ruby ATM (I would be really happy if it would be,
but from what I have seen it's not - there was some initial try to
implement rbXPCOM, but it was abandoned in 2001). Maybe some other
browser (safari, opera?)

Btw. which feature of 'browser-like'-ness would you like to use? What
are your exact requirements?

Peter

__
http://www.rubyra...

Paul Lutus

12/17/2006 7:33:00 PM

Peter Szinek wrote:

/ ...

> Btw. ever heard of 'reinventing the wheel'?

I don't generally reinvent the wheel until the existing wheel breaks. This
is one of those cases.

> Also your (otherwise great) code snippets always assume that the
> underlying HTML is well formed, and x and y and z - which is in real
> life almost never the case.

Yes, true, my code is typically quite fragile and can only handle
essentially perfect HTML, and I generally offer that exact warning.
Ironically, though, in this case, my naive solution parsed the HTML that
caused Hpricot to fail.

> Of course the posters here are not pasting
> 200K of HTML against which they run they production code, but a few
> lines of example which is usually an oversimplification of the problem.

Almost always. But in this case Hpricot failed on the provided short
example, with a single deviant tag syntax.

> This another point where such libraries are great: they handle 844747
> special cases (if your case is not among them, see the current-2nd
> paragraph, or add it there on your own) which is always a problematic
> thing in case of hand written stuff.

Absolutely. I don't generally post my offer of a few lines of code unless
and until a library has failed. In this case, it failed.

> I could state here 100 another points which would prove that in
> production, libraries are almost always better choice over hand written
> code on the fly -

Yes, unfortunately none of them would successfully answer this OP's call
from the real world. Libraries are the obvious solution to this kind of
task. They have everything going for them, up to, but not including, the
moment when they fail to meet the user's requirements.

I have to say that I see a lot of posts that follow this pattern. The
library seems to be able to solve any number of difficult problems except
the specific problem the user happens to be facing.

And my typical offered, simple solution is not meant to, and cannot stand in
for, the 2^32 special cases that have been laboriously programmed into the
library. It can only provide an overlooked special need that the library
cannot provide. It's surprising to me how often this happens.

--
Paul Lutus
http://www.ara...

_why

12/18/2006 6:17:00 AM

On Sun, Dec 17, 2006 at 06:55:48PM +0900, Henry Maddocks wrote:
> The html in the attached file renders 'correctly' in the 3 browsers I
> have tried but it tricks hpricot because of the second malformed
> comment.

Great stuff! Thankyou. This is going to be a fun one to work on, so I'll get
back to you when I've got the medicine.

_why

Henry Maddocks

12/18/2006 6:17:00 AM

On 17/12/2006, at 11:15 PM, Paul Lutus wrote:

> Henry Maddocks wrote:
>
>> Sorry, try again...
>>
>> Not sure where to send this, sorry if it's not the right place...
>>
>> The html in the attached file renders 'correctly' in the 3 browsers I
>> have tried but it tricks hpricot because of the second malformed
>> comment. When I say correctly I mean I get to see 'Some text'. I
>> guess it could be argued that this is incorrect. For my application
>> it would be nice if hpricot behaved like a browser.

Paul,

before I address your response directly I will say that I am aware of
your crusade against html parsing libraries and while I believe you
are entitled to your opinion, I disagree with it. I have done enough
of this sort of thing to know that, for me, the level of abstraction
that these libraries gives is both beneficial in development time and
maintenance. I am neither an html nuby, nor a ruby nuby. I am also
aware that my needs may not match those of some one else so I'm not
going to ram my opinions down there throat every time they ask for a
little help.

> You have created a new thread, and you have not attached any prior
> text.
> This requires us to start over.

As this is the first time I have posted on this subject, that much is
obvious. Unless I am missing something.

> Tell us what you hoped would happen, what happened instead, and how
> they
> differ.

Run the script and that too will be obvious.

> If your goal is to filter particular content from HTML pages, just
> say so,
> and be specific about what you want and don't want. Given this
> information,
> I will show you how to extract the desired content with a few lines of
> Ruby, no fuss, no undue complexity, no Hpricot.

My goal is to highlight an issue I found with a particular library
and provide some sample code that shows the problem with the minimum
amount of code. I posted it here so that there may be some discussion
with interested people as to the desired behaviour.

> IIRC, you had asked for help using Hpricot to extract text between
> and
> tag pairs, but with the added requirement that there be an IMG
> tag
> within the ... tag pair to validate the case. Is this
> still the
> goal? If so, how did my previously posted, simple solution work out
> for
> you?

What IMG tag? There isn't one in the sample code. What previous
solution? You do not recall correctly.

> This is a scene in a much larger play, one in which someone says,
> "Wow, I
> had no idea there was such a powerful library, so carefully
> designed, so
> complete. But, notwithstanding its extraordinary features,
> notwithstanding
> the hundreds of man-hours expended creating it ... I can't get it
> to do
> what I want."

The incident that that prompted my post went thus...
I had a page that seemed to render fine in a browser but when parsing
it my code failed. I inspected the html and found a malformed comment
to be the problem. Probably put there to stop screen scraping. I
wrote a bit of code, using regexps no less, that removed the
offending comment and hpricot then went on it's merry way. Job done.
I thought others may be interested so I posted some sample code. I am
now regretting that decision.

> This is a very common refrain. I think I can solve your problem
> with a few
> lines of Ruby code, code that you can easily understand and adapt to
> specific and evolving requirements. And if I cannot do this, I will
> say so.

I could too, but I don't care.

> --
> Paul Lutus

Thanks for hijacking my thread. Thanks for nothing.

Henry Maddocks

12/18/2006 6:17:00 AM

On 17/12/2006, at 11:51 PM, Peter Szinek wrote:

>> Given this information,
>> I will show you how to extract the desired content with a few
>> lines of
>> Ruby, no fuss, no undue complexity, no Hpricot.
> Why should it be complicated? What fuss? Who needs few lines?
> With the
> current version of hpricot this is exactly one line:
>
> doc//p[img]//text()

Maybe I'm going mad but there is no img tag in the sample code. I am
not interested in extracting anything. I know how to do that. I am
trying to highlight a problem I discovered in hpricot.

Paul Lutus

12/18/2006 8:40:00 AM

Henry Maddocks wrote:

>
> On 17/12/2006, at 11:15 PM, Paul Lutus wrote:
>
>> Henry Maddocks wrote:
>>
>>> Sorry, try again...
>>>
>>> Not sure where to send this, sorry if it's not the right place...
>>>
>>> The html in the attached file renders 'correctly' in the 3 browsers I
>>> have tried but it tricks hpricot because of the second malformed
>>> comment. When I say correctly I mean I get to see 'Some text'. I
>>> guess it could be argued that this is incorrect. For my application
>>> it would be nice if hpricot behaved like a browser.
>
> Paul,
>
> before I address your response directly I will say that I am aware of
> your crusade against html parsing libraries

There is no such campaign, as I have been at pains to point out. Prove your
assertion using the content of posts made to this newsgroup.

> and while I believe you
> are entitled to your opinion, I disagree with it.

You are disagreeing with your opinion, not mine.

> I have done enough
> of this sort of thing to know that, for me, the level of abstraction
> that these libraries gives is both beneficial in development time and
> maintenance.

That happens to be a view I agree with, as you would know if you were to
read my posts.

/ ...

> As this is the first time I have posted on this subject, that much is
> obvious. Unless I am missing something.

So this is a new thread, with the beginning line "Sorry, try again..."?
Okay, fine. I assumed you were starting a new thread on an existing
subject, an assumption you helped along.

>> Tell us what you hoped would happen, what happened instead, and how
>> they
>> differ.
>
> Run the script and that too will be obvious.

I did. No problem unless one uses Hpricot to parse it.

/ ...

>> IIRC, you had asked for help using Hpricot to extract text between
>> and
>> tag pairs, but with the added requirement that there be an IMG
>> tag
>> within the ... tag pair to validate the case. Is this
>> still the
>> goal? If so, how did my previously posted, simple solution work out
>> for
>> you?
>
> What IMG tag? There isn't one in the sample code. What previous
> solution? You do not recall correctly.

There was a thread involving the task of parsing text between ... 
tag pairs here using Hpricot last week, your first line indicated a new
thread on an existing topic. I tagged my paragraph with "IIRC", that is
adequate guidance, or it should be.

> Thanks for hijacking my thread. Thanks for nothing.

Try to aim a bit higher when you post here. This isn't IRC.

--
Paul Lutus
http://www.ara...

Henry Maddocks

12/18/2006 9:03:00 AM

On 18/12/2006, at 7:16 PM, _why wrote:

> On Sun, Dec 17, 2006 at 06:55:48PM +0900, Henry Maddocks wrote:
>> The html in the attached file renders 'correctly' in the 3 browsers I
>> have tried but it tricks hpricot because of the second malformed
>> comment.
>
> Great stuff! Thankyou. This is going to be a fun one to work on,
> so I'll get
> back to you when I've got the medicine.

It's not a big deal. Like I said, it's easy to work around. Just
thought you'd like to know.

Chris Carter

12/18/2006 1:01:00 PM

Henry, There was some just a few days ago who had a problem with using
Hpricot, and IMG elements in P tags. Paul must have gotten you two
confused.

On 12/18/06, Henry Maddocks <henryj@paradise.net.nz> wrote:
>
> On 17/12/2006, at 11:15 PM, Paul Lutus wrote:
>
> > Henry Maddocks wrote:
> >
> >> Sorry, try again...
> >>
> >> Not sure where to send this, sorry if it's not the right place...
> >>
> >> The html in the attached file renders 'correctly' in the 3 browsers I
> >> have tried but it tricks hpricot because of the second malformed
> >> comment. When I say correctly I mean I get to see 'Some text'. I
> >> guess it could be argued that this is incorrect. For my application
> >> it would be nice if hpricot behaved like a browser.
>
> Paul,
>
> before I address your response directly I will say that I am aware of
> your crusade against html parsing libraries and while I believe you
> are entitled to your opinion, I disagree with it. I have done enough
> of this sort of thing to know that, for me, the level of abstraction
> that these libraries gives is both beneficial in development time and
> maintenance. I am neither an html nuby, nor a ruby nuby. I am also
> aware that my needs may not match those of some one else so I'm not
> going to ram my opinions down there throat every time they ask for a
> little help.
>
>
> > You have created a new thread, and you have not attached any prior
> > text.
> > This requires us to start over.
>
> As this is the first time I have posted on this subject, that much is
> obvious. Unless I am missing something.
>
>
> > Tell us what you hoped would happen, what happened instead, and how
> > they
> > differ.
>
> Run the script and that too will be obvious.
>
>
> > If your goal is to filter particular content from HTML pages, just
> > say so,
> > and be specific about what you want and don't want. Given this
> > information,
> > I will show you how to extract the desired content with a few lines of
> > Ruby, no fuss, no undue complexity, no Hpricot.
>
> My goal is to highlight an issue I found with a particular library
> and provide some sample code that shows the problem with the minimum
> amount of code. I posted it here so that there may be some discussion
> with interested people as to the desired behaviour.
>
>
> > IIRC, you had asked for help using Hpricot to extract text between
> > and
> > tag pairs, but with the added requirement that there be an IMG
> > tag
> > within the ... tag pair to validate the case. Is this
> > still the
> > goal? If so, how did my previously posted, simple solution work out
> > for
> > you?
>
> What IMG tag? There isn't one in the sample code. What previous
> solution? You do not recall correctly.
>
>
> > This is a scene in a much larger play, one in which someone says,
> > "Wow, I
> > had no idea there was such a powerful library, so carefully
> > designed, so
> > complete. But, notwithstanding its extraordinary features,
> > notwithstanding
> > the hundreds of man-hours expended creating it ... I can't get it
> > to do
> > what I want."
>
> The incident that that prompted my post went thus...
> I had a page that seemed to render fine in a browser but when parsing
> it my code failed. I inspected the html and found a malformed comment
> to be the problem. Probably put there to stop screen scraping. I
> wrote a bit of code, using regexps no less, that removed the
> offending comment and hpricot then went on it's merry way. Job done.
> I thought others may be interested so I posted some sample code. I am
> now regretting that decision.
>
>
> > This is a very common refrain. I think I can solve your problem
> > with a few
> > lines of Ruby code, code that you can easily understand and adapt to
> > specific and evolving requirements. And if I cannot do this, I will
> > say so.
>
> I could too, but I don't care.
>
>
> > --
> > Paul Lutus
>
> Thanks for hijacking my thread. Thanks for nothing.
>
>
>

--
Chris Carter
concentrationstudios.com
brynmawrcs.com

comp.lang.ruby

hpricot problem

Henry Maddocks

Paul Lutus

Peter Szinek

Peter Szinek

Paul Lutus

_why

Henry Maddocks

Henry Maddocks

Paul Lutus

Henry Maddocks

Chris Carter

x Login to ForumsZone