Asp Forum - Hpricot - best way to parse based on comments

Jerome ---

11/20/2006 10:52:00 PM

I am trying to parse some files that contain comments like this:

<html>
<body>



images, text, etc...



Interesting text of site here.

</body>
</html>

I am wondering how to go about extracting the data within the comments
block using Hpricot. I am not aware of a way to refer to commented HTML
through CSS or XPath selectors.

Thanks for any ideas!

- Jerome

--
Posted via http://www.ruby-....

3 Answers

Keith Fahlgren

11/20/2006 11:51:00 PM

On 11/20/06, Jerome --- <jerome@tut0r.com> wrote:
> I am trying to parse some files that contain comments like this:
> ...
> I am not aware of a way to refer to commented HTML
> through CSS or XPath selectors.

The XPath comment() selector will select all comments:

For example (xpath after -m flag):
keith@devel ~ $ xml sel -t -m '//comment()' -v '.' -n simple.xml
one comment
two comment

keith@devel ~ $ cat simple.xml
<simple>

<foo/>

<bar/>
</simple>

HTH,
Keith

Ken Bloom

11/21/2006 3:20:00 PM

On Tue, 21 Nov 2006 07:52:12 +0900, Jerome --- wrote:

> I am trying to parse some files that contain comments like this:
>
> <html>
> <body>
>
> 
>
> images, text, etc...
>
> 
>
> Interesting text of site here.
>
> </body>
> </html>
>
>
> I am wondering how to go about extracting the data within the comments
> block using Hpricot. I am not aware of a way to refer to commented HTML
> through CSS or XPath selectors.
>
> Thanks for any ideas!
>
> - Jerome
>

Why not gsub out the unwanted sections before parsing with hpricot, or
if the data you want is nested between comments, use a regexp to narrow
down the document to only the text between the comments before parsing
with hpricot?

--Ken Bloom

--
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu...

Paul Lutus

11/21/2006 9:51:00 PM

Jerome --- wrote:

> I am trying to parse some files that contain comments like this:
>
> <html>
> <body>
>
> 
>
> images, text, etc...
>
> 
>
> Interesting text of site here.
>
> </body>
> </html>
>
>
> I am wondering how to go about extracting the data within the comments
> block using Hpricot.

The best and easiest way to parse this file using Hpricot with your required
specification ... is not to use Hpricot.

start_mark = ""
end_mark = ""

data = File.read(page_path)

output = data.scan(%r{#{start_mark}(.*?)#{end_mark}}m)

All done, finished, no poring over documentation, no considering rewriting
the library to get it to do what you actually want, done.

By the way. Did I mention that inserting new data into the same page
structure is about the same level of difficulty?

--
Paul Lutus
http://www.ara...

comp.lang.ruby

Hpricot - best way to parse based on comments

Jerome ---

Keith Fahlgren

Ken Bloom

Paul Lutus

x Login to ForumsZone