Asp Forum - hpricot selective text modification

Siddharth Karandikar

12/13/2006 10:17:00 AM

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-t...
is an answer to most of my requirements, except one.

How can I do a selective traverse_text so that I can skip text of
specific tags?

One option was to use parent.name while traversing over text.
Here is the code that I tried,

require 'hpricot'
class Hpricot::Text
def set(string)
@content = string
self.raw_string = string
end
end

s = <<HTML
<html>
<body>
<h4>Abcd</h4>
<java>this is in java1</java>
<ul>
<li>aabbcc</li>
<li>mmnnoo</li>
<li><java>this is in java2</java></li>
</ul>
<java>this is in java3</java>
</body>
</html>
HTML

index = Hpricot.parse(s)
index.traverse_text { |text|
t = text.to_s.strip
if text.parent and text.parent.name and text.parent.name != 'java' and
not t.empty?
t = "=#{t}="
text.set(t)
puts "Modified text to:#{t}"
end
}
puts index

Getting following error,

Modified text to:=Abcd=
Modified text to:=aabbcc=
Modified text to:=mmnnoo=
hpricot-test1.rb:30: undefined method `name' for
#<Hpricot::Doc:0x2e49c18> (NoMethodError)
from
c:/ruby/lib/ruby/gems/1.8/gems/hpricot-0.4-mswin32/lib/hpricot/traverse.rb:377:in
`traverse_text_internal'
from
c:/ruby/lib/ruby/gems/1.8/gems/hpricot-0.4-mswin32/lib/hpricot/traverse.rb:366:in
`traverse_text_internal'
from
c:/ruby/lib/ruby/gems/1.8/gems/hpricot-0.4-mswin32/lib/hpricot/traverse.rb:146:in
`each'
from
c:/ruby/lib/ruby/gems/1.8/gems/hpricot-0.4-mswin32/lib/hpricot/traverse.rb:146:in
`each_child'
from
c:/ruby/lib/ruby/gems/1.8/gems/hpricot-0.4-mswin32/lib/hpricot/traverse.rb:366:in
`traverse_text_internal'
from
c:/ruby/lib/ruby/gems/1.8/gems/hpricot-0.4-mswin32/lib/hpricot/traverse.rb:358:in
`traverse_text'
from hpricot-test1.rb:28

Am I making any mistake?

I am new to the world of Ruby and Hpricot ... so please bear with me.

- Siddharth

5 Answers

Paul Lutus

12/13/2006 10:34:00 AM

Siddharth Karandikar wrote:

> http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-t...
> is an answer to most of my requirements, except one.
>
> How can I do a selective traverse_text so that I can skip text of
> specific tags?

/ ... snip lengthy listing of Hpricot error messages

> Am I making any mistake?

Rather than describe the problems you are having trying to make Hpricot
deliver a particular result, why not say what you are trying to accomplish
and we can discuss that instead?

Parsing and extracting particular text from syntactically correct HTML pages
is relatively easy. It only requires a few lines of Ruby code. You can
choose which tags to extract text from, and leave all the others behind.

In some cases, it is simpler to write your own extraction code than to try
to get a library to do this for you. But this approach requires that the
HTML pages be reasonably error-free -- it doesn't work very well if there
are errors in the syntax of the source pages.

If the pages you have to parse are reasonably error-free, you may have a
much easier time getting what you are after than you may think at this
point.

--
Paul Lutus
http://www.ara...

Siddharth Karandikar

12/13/2006 10:52:00 AM

Here is the scenario,

I am trying to have my blog in 2 languages. English and my native
language 'marathi'. The blog posts will be written in plain text. Using
bluecloth, I am generating required html markup.
I have hacked bluecloth to spit <english>...</english> in required
places,

e.g.
### <E title E>

will generate
<h3><english>title</english></h3>

Now when I get this kind of html, I would like to skip all the text
under 'english' tag and convert all the remaining text to my language
'marathi' (utf8 codes). Using Hpricot for this.

After that I am thinking of removing all the 'english' tags but keeping
the markup surrounding them.

- Siddharth

On Dec 13, 3:34 pm, Paul Lutus <nos...@nosite.zzz> wrote:
> Siddharth Karandikar wrote:
> >http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-t...
> > is an answer to most of my requirements, except one.
>
> > How can I do a selective traverse_text so that I can skip text of
> > specific tags?/ ... snip lengthy listing of Hpricot error messages
>
> > Am I making any mistake?Rather than describe the problems you are having trying to make Hpricot
> deliver a particular result, why not say what you are trying to accomplish
> and we can discuss that instead?
>
> Parsing and extracting particular text from syntactically correct HTML pages
> is relatively easy. It only requires a few lines of Ruby code. You can
> choose which tags to extract text from, and leave all the others behind.
>
> In some cases, it is simpler to write your own extraction code than to try
> to get a library to do this for you. But this approach requires that the
> HTML pages be reasonably error-free -- it doesn't work very well if there
> are errors in the syntax of the source pages.
>
> If the pages you have to parse are reasonably error-free, you may have a
> much easier time getting what you are after than you may think at this
> point.
>
> --
> Paul Lutushttp://www.ara...

Peter Szinek

12/13/2006 10:56:00 AM

>
> Am I making any mistake?
Sure :).

Some W3C DOM theory:

A document consists of different Nodes - in practice subclasses of Node:
Element, Document, Attribute, Comment, Text, ProcessingInstruction etc
(just from the top of my head - there are some more like
DocumentFragment , CData, ... but it is unlikely you will need them
here). Not every Node has a name, or children, or parent, or xxx. You
have to make sure that the subclass of Node you are talking to is
actually responding to a method you are trying to send him.

a Hpricot DOM is not exactly a W3C DOM, but it is mostly similar:

Only HPricot::Element has children, (not HPricot::Document or
HPricot::Comment or...) and also not every Node has a name - like in
your example HPricot::Document (Similarly HPricot::Text or
HPricot::Comment does not have a name...). Also A HPricot::Document does
not have a parent I think.

Your problem is that you are traversing up, and reach the Document node
which does not have a method name.

So you have to modify your code like this:

if text.parent and text.parent.name and text.parent.name != 'java' and

to

parent = text.parent
if (parent.instance_of? Hpricot::Text) #or with respond_to, or with
parent.parent == nil
#do the stuff
else
#you have reached the top Node - Document; nothing to do
end

(The else branch is not needed, I just added it for illustration)

HTH,
Peter

__
http://www.rubyra...

Siddharth Karandikar

12/13/2006 11:12:00 AM

Thanks Peter.
I need to improve my knowledge abt DOM in general.

I have modified the code and do "if p.instance_of? Hpricot::Elem and
....."
Right now, Its working fine for me. Still need to think abt all the
possible cases.

Thanks,
Siddharth

On Dec 13, 3:55 pm, Peter Szinek <p...@rubyrailways.com> wrote:
> > Am I making any mistake?Sure :).
>
> Some W3C DOM theory:
>
> A document consists of different Nodes - in practice subclasses of Node:
> Element, Document, Attribute, Comment, Text, ProcessingInstruction etc
> (just from the top of my head - there are some more like
> DocumentFragment , CData, ... but it is unlikely you will need them
> here). Not every Node has a name, or children, or parent, or xxx. You
> have to make sure that the subclass of Node you are talking to is
> actually responding to a method you are trying to send him.
>
> a Hpricot DOM is not exactly a W3C DOM, but it is mostly similar:
>
> Only HPricot::Element has children, (not HPricot::Document or
> HPricot::Comment or...) and also not every Node has a name - like in
> your example HPricot::Document (Similarly HPricot::Text or
> HPricot::Comment does not have a name...). Also A HPricot::Document does
> not have a parent I think.
>
> Your problem is that you are traversing up, and reach the Document node
> which does not have a method name.
>
> So you have to modify your code like this:
>
> if text.parent and text.parent.name and text.parent.name != 'java' and
>
> to
>
> parent = text.parent
> if (parent.instance_of? Hpricot::Text) #or with respond_to, or with
> parent.parent == nil
> #do the stuff
> else
> #you have reached the top Node - Document; nothing to do
> end
>
> (The else branch is not needed, I just added it for illustration)
>
> HTH,
> Peter
>
> __http://www.rubyra...

Paul Lutus

12/13/2006 11:20:00 AM

Siddharth Karandikar wrote:

> Here is the scenario,
>
> I am trying to have my blog in 2 languages. English and my native
> language 'marathi'. The blog posts will be written in plain text. Using
> bluecloth, I am generating required html markup.
> I have hacked bluecloth to spit <english>...</english> in required
> places,
>
> e.g.
> ### <E title E>
>
> will generate
> <h3><english>title</english></h3>
>
> Now when I get this kind of html, I would like to skip all the text
> under 'english' tag and convert all the remaining text to my language
> 'marathi' (utf8 codes). Using Hpricot for this.

Okay, that sounds a great deal more complex than a typical text extraction
task from an HTML page. I assume you mean to preserve some parts unchanged,
while translating other parts, and reassemble the page at the end of the
process.

This could be done using your own custom code, but only if a much more
specific, detailed description were offered. The same thing could be said
of an Hpricot-based approach, by the way.

> After that I am thinking of removing all the 'english' tags but keeping
> the markup surrounding them.

Okay, that part is easy:

data.gsub!(%r{<english>.*?</english>}im,"")

Most tasks in this class are easy to accomplish, as long as the description
is clear and detailed enough.

--
Paul Lutus
http://www.ara...

comp.lang.ruby

hpricot selective text modification

Siddharth Karandikar

Paul Lutus

Siddharth Karandikar

Peter Szinek

Siddharth Karandikar

Paul Lutus

x Login to ForumsZone