Asp Forum - Hpricot html parsing

Dhanasekaran Vivekanandhan

12/13/2006 1:05:00 PM

hi all,
I have the following html fragment
I want to get the inner html content inside the
 <img> tag , not the between the tag.
for example in the following example i want to get the result as
"this is fun". I dont want to get the result including "NO FUN".
how to do with Hpricot

example html fragment:
----------------------


this is fun
<img src="" class="dhans"/>


NO FUN


thanks in advance,
dhanasekaran

--
Posted via http://www.ruby-....

12 Answers

Peter Szinek

12/13/2006 1:12:00 PM

Dhanasekaran Vivekanandhan wrote:
> hi all,
> I have the following html fragment
> I want to get the inner html content inside the
> <img> tag , not the between the tag.
> for example in the following example i want to get the result as
> "this is fun". I dont want to get the result including "NO FUN".
> how to do with Hpricot
>
> example html fragment:
> ----------------------
>
> 
> this is fun
> <img src="" class="dhans"/>
> 
> 
> NO FUN
> 

I did not quite get you. You want the text of the first because it
has an image?
Or what is the exact criterion to accept/reject 's?

Peter

__
http://www.rubyra...

Dhanasekaran Vivekanandhan

12/13/2006 1:42:00 PM

yes, I want the text of the first because it
has an image. and reject if has no image.
thanks,
Dhanasekaran

--
Posted via http://www.ruby-....

lrlebron@gmail.com

12/13/2006 1:56:00 PM

You can try something like this:

if p.search("img").length > 0
puts p.inner_html
end

Dhanasekaran Vivekanandhan wrote:
> yes, I want the text of the first because it
> has an image. and reject if has no image.
> thanks,
> Dhanasekaran
>
> --
> Posted via http://www.ruby-....

Peter Szinek

12/13/2006 1:59:00 PM

Dhanasekaran Vivekanandhan wrote:
> yes, I want the text of the first because it
> has an image. and reject if has no image.
> thanks,
I see. Try this:
===============================================
require 'rubygems'
require 'hpricot'

doc = Hpricot %q{
this is fun
<img src="" class="dhans"/>


NO FUN


fun again!
<img src=""/>


NO FUN AT ALL!

}

paragraphs = doc/'p'

good_elems = paragraphs.map.reject {|elem| ((elem/"img").empty?) }
good_elems.each { |elem| puts elem.inner_text.strip }
===============================================

output:

************
this is fun
fun again!
************

You will need hpricot 0.4.84 because of inner_text - if you don't want
to install it (I did not experience any difficulties, so I can recommend
it) then you have to roll your own inner_text, but I guess this is not a
big problem.

Cheers,
Peter

Paul Lutus

12/13/2006 5:17:00 PM

Dhanasekaran Vivekanandhan wrote:

> yes, I want the text of the first because it
> has an image. and reject if has no image.

Hpricot might be able to do this, but you can also do it on your own, and
know why the solution works.

---------------------------------------

#!/usr/bin/ruby -w

data = File.read("test.html")

array = data.scan(%r{([^<]+?)<img .*?/>})

p array

---------------------------------------

Input text:

don't want this text

want this text<img src=""/>

don't want this text either

want this text too<img src=""/>

Output:

[["want this text"], ["want this text too"]]

--
Paul Lutus
http://www.ara...

David Vallner

12/13/2006 11:28:00 PM

Peter Szinek wrote:
> paragraphs = doc/'p'
>
> good_elems = paragraphs.map.reject {|elem| ((elem/"img").empty?) }

Which once again makes me wish paragraphs = doc/'//p[img]/text()'
worked. This could be doable if you asked Hpricot to provide you with
the REXML document (it's probably out of scope for the intendedly simple
XPath engine Hpricot uses natively), but unfortunately I can't for the
heck of it figure out how to make REXML accept the final /text(), even
though the parser claims to support XPath 1.0 except a few exceptions,
that one not being noted.

David Vallner

RubyTalk@gmail.com

12/15/2006 6:55:00 PM

Ask:

http://code.whytheluckystiff.net/hpricot...

text in xpath should return a text node if present. For example:
(doc/"/html/body/div[1]/*/table[0]/tr[0]/*/b[9]/text")

Currently I am using the search and next_node:

doc.search("/html/body/div[1]/*/table[0]/tr[0]/td/b"){|x|
@movie_plot=x.next_node.to_s.strip if x.inner_html=="Plot Outline:" }

And receive

Author:
why
Message:

* lib/hpricot/elements.rb: added support for selecting text
nodes with text(): //p/text(), //p[a]//text(), etc.
* lib/hpricot/traverse.rb: ditto.
* lib/hpricot/tag.rb: the pathname method reports the path
fragment needed to get to this node.
* lib/hpricot/parse.rb: handle possible empty processing instruction.
http://code.whytheluckystiff.net/hpricot/ch...

On 12/13/06, David Vallner <david@vallner.net> wrote:
> Peter Szinek wrote:
> > paragraphs = doc/'p'
> >
> > good_elems = paragraphs.map.reject {|elem| ((elem/"img").empty?) }
>
> Which once again makes me wish paragraphs = doc/'//p[img]/text()'
> worked. This could be doable if you asked Hpricot to provide you with
> the REXML document (it's probably out of scope for the intendedly simple
> XPath engine Hpricot uses natively), but unfortunately I can't for the
> heck of it figure out how to make REXML accept the final /text(), even
> though the parser claims to support XPath 1.0 except a few exceptions,
> that one not being noted.
>
> David Vallner
>
>
>
>

David Vallner

12/16/2006 5:18:00 PM

ruby talk wrote:
> Ask:
>
> http://code.whytheluckystiff.net/hpricot...
>
> text in xpath should return a text node if present. For example:
> (doc/"/html/body/div[1]/*/table[0]/tr[0]/*/b[9]/text")
>

Well, it's 'text()' not 'text'. Luckily _why noticed.

> * lib/hpricot/elements.rb: added support for selecting text
> nodes with text(): //p/text(), //p[a]//text(), etc.

W00t ;P

Thanks for pointing this out.

David Vallner

PS: Your email address name confuses the heck out of me. Please use
something that doesn't cause a mental namespace clash?

RubyTalk@gmail.com

12/16/2006 8:16:00 PM

On 12/16/06, David Vallner <david@vallner.net> wrote:
> ruby talk wrote:
> > Ask:
> >
> > http://code.whytheluckystiff.net/hpricot...
> >
> > text in xpath should return a text node if present. For example:
> > (doc/"/html/body/div[1]/*/table[0]/tr[0]/*/b[9]/text")
> >
>
> Well, it's 'text()' not 'text'. Luckily _why noticed.
>
> > * lib/hpricot/elements.rb: added support for selecting text
> > nodes with text(): //p/text(), //p[a]//text(), etc.
>
> W00t ;P
>
> Thanks for pointing this out.
>
> David Vallner
>
> PS: Your email address name confuses the heck out of me. Please use
> something that doesn't cause a mental namespace clash?
>
>
>
>

Sorry, I have been archiving ruby talk at rubytalk@gmail.com since 10/14/04.

Stephen Becker IV

Dhanasekaran Vivekanandhan

12/18/2006 7:49:00 AM

Thanks Peter ,
Your solution worked. and I just wanted to know , where can I find the
syntax for Hpricot like the one you gave as a solution,

thanks,
dhanasekaran

--
Posted via http://www.ruby-....

comp.lang.ruby

Hpricot html parsing

Dhanasekaran Vivekanandhan

Peter Szinek

Dhanasekaran Vivekanandhan

lrlebron@gmail.com

Peter Szinek

Paul Lutus

David Vallner

RubyTalk@gmail.com

David Vallner

RubyTalk@gmail.com

Dhanasekaran Vivekanandhan

x Login to ForumsZone