Asp Forum - Hpricot elem index/position

henryturnerlists@googlemail.com

10/6/2008 2:19:00 PM

Hey,

Trying to find the String index of an Hpricot::Elem within its doc.
For example..

doc = Hpricot("<a>bob</a><a>james</a><a>dan</a>")
elem = doc.search("a")[1]
elem.start #=> 10 ( the first '<' of the second a tag.)

and eventually the following would be good..

elem.length #=> 12
elem.end #=> 21

Any thoughts appreciated!
Henners

7 Answers

Mark Thomas

10/6/2008 8:13:00 PM

On Oct 6, 10:19 am, "henryturnerli...@googlemail.com"
<henryturnerli...@googlemail.com> wrote:
> Hey,
>
> Trying to find the String index of an Hpricot::Elem within its doc.
> For example..
>
> doc = Hpricot("<a>bob</a><a>james</a><a>dan</a>")
> elem = doc.search("a")[1]
> elem.start #=> 10 ( the first '<' of the second a tag.)
>
> and eventually the following would be good..
>
> elem.length #=> 12
> elem.end #=> 21
>
> Any thoughts appreciated!
> Henners

My first thought is: Why do you want that information? Character
position is meaningless in an XML and HTML DOM. Whitespace can change
character positions without affecting the DOM at all.

-- Mark.

henryturnerlists@googlemail.com

10/7/2008 7:58:00 AM

Hi Mark,

I'm writing a broken link reporting type tool. When I find a dodgy tag
I'd like to be able to relay the character position and or line number
to the user. Useful for debugging.

Thanks -h

On Oct 6, 9:13=A0pm, Mark Thomas <m...@thomaszone.com> wrote:
> On Oct 6, 10:19=A0am, "henryturnerli...@googlemail.com"
>
>
>
> <henryturnerli...@googlemail.com> wrote:
> > Hey,
>
> > Trying to find the String index of an Hpricot::Elem within its doc.
> > For example..
>
> > doc =3D Hpricot("<a>bob</a><a>james</a><a>dan</a>")
> > elem =3D doc.search("a")[1]
> > elem.start #=3D> 10 ( the first '<' of the second a tag.)
>
> > and eventually the following would be good..
>
> > elem.length #=3D> 12
> > elem.end #=3D> 21
>
> > Any thoughts appreciated!
> > Henners
>
> My first thought is: Why do you want that information? Character
> position is meaningless in an XML and HTML DOM. Whitespace can change
> character positions without affecting the DOM at all.
>
> -- Mark.

Mark Thomas

10/7/2008 1:52:00 PM

On Oct 7, 3:58 am, "henryturnerli...@googlemail.com"
<henryturnerli...@googlemail.com> wrote:
> Hi Mark,
>
> I'm writing a broken link reporting type tool. When I find a dodgy tag
> I'd like to be able to relay the character position and or line number
> to the user. Useful for debugging.

So, are you really interested in broken *links* (as in a GET does not
return a 200 result code) or broken HTML? I have done the former via
AJAX (jQuery sends links to a backend rails action, and if it is
broken changes the class of the link to display a red background). The
latter may be able to be done with libxml, which reports the character
position of broken input.

-- Mark.

henryturnerlists@googlemail.com

10/7/2008 2:28:00 PM

Well, I suppose there are incorrectly formatted links too... I was
talking about correctly formatted links that point to a 400+ status
code resource. Something libxml would not pick up since I guess you're
talking about its syntax checking bit.

Since the entire document is accessible from the Hpricot::Elem it
seems plausible to count the characters up to and after the element. A
15min look at the source didn't reveal anything obvious.. Have a nasty
feeling that this type of thing would have to be done in the compiled
C section of it..

On Oct 7, 2:53=A0pm, Mark Thomas <m...@thomaszone.com> wrote:
> On Oct 7, 3:58=A0am, "henryturnerli...@googlemail.com"
>
> <henryturnerli...@googlemail.com> wrote:
> > Hi Mark,
>
> > I'm writing a broken link reporting type tool. When I find a dodgy tag
> > I'd like to be able to relay the character position and or line number
> > to the user. Useful for debugging.
>
> So, are you really interested in broken *links* (as in a GET does not
> return a 200 result code) or broken HTML? I have done the former via
> AJAX (jQuery sends links to a backend rails action, and if it is
> broken changes the class of the link to display a red background). The
> latter may be able to be done with libxml, which reports the character
> position of broken input.
>
> -- Mark.

Mark Thomas

10/7/2008 5:37:00 PM

On Oct 7, 10:28 am, "henryturnerli...@googlemail.com"
<henryturnerli...@googlemail.com> wrote:
> Well, I suppose there are incorrectly formatted links too... I was
> talking about correctly formatted links that point to a 400+ status
> code resource. Something libxml would not pick up since I guess you're
talking about its syntax checking bit.

Well, libxml stores the line number of every element. So you can
extract all links, check them, and print out element.line_num for each
one that fails the check.

Here's some starter code:

#----------------------------------------------

require 'rubygems'
require 'xml'

XML::Parser.default_line_numbers = true

html = <<END_HTML
<html>
<head><title>test</title></head>
<body>
Here is a <a href="http://brok.en"... link.</a>
</body>
</html>
END_HTML

parser = XML::Parser.string html
doc = parser.parse

def broken?(link)
true
end

doc.find("//a[@href]").each do |link|
if broken?(link)
puts "Broken link to #{link['href']} on line #{link.line_num}"
end
end

Mark Thomas

10/8/2008 3:00:00 AM

On Oct 7, 1:36 pm, I wrote:
> Well, libxml stores the line number of every element. So you can
> extract all links, check them, and print out element.line_num for each
> one that fails the check.

Oops, my example mistakenly used the XML parser, so replace that with
XML::HTMLparser since you are parsing HTML.

-- Mark.

henryturnerlists@googlemail.com

10/8/2008 7:05:00 PM

Thanks for the hint towards to libxml-ruby! I didn't even know it
existed. Can't see anything for character position but very happy
indeed. Will have a go at implementing it myself when poss..

cheers
-h

On Oct 8, 3:57=A0am, Mark Thomas <m...@thomaszone.com> wrote:
> On Oct 7, 1:36=A0pm, I wrote:
>
> > Well, libxml stores the line number of every element. So you can
> > extract all links, check them, and print out element.line_num for each
> > one that fails the check.
>
> Oops, my example mistakenly used the XML parser, so replace that with
> XML::HTMLparser since you are parsing HTML.
>
> -- Mark.

comp.lang.ruby

Hpricot elem index/position

henryturnerlists@googlemail.com

Mark Thomas

henryturnerlists@googlemail.com

Mark Thomas

henryturnerlists@googlemail.com

Mark Thomas

Mark Thomas

henryturnerlists@googlemail.com

x Login to ForumsZone