Asp Forum - Need a regex searching html code

Chirantan

2/28/2008 6:36:00 AM

I have an html code into string. I want to retrieve the content (Can
be any HTML code with any number of tags) present inside the div after
the heading till the end of the div.

Example,

<div class="info">
<h5>Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

<div class="info">
<h5>Plot Outline:</h5>
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsum...
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>
</div>

In the above example, Plot Outline is header that I am looking for
then, regex should give me -

John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsum...
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>

And if "Tagline:" is what I am looking for then regex should give me -

Yippee Ki Yay Mo - John 6:27

I hope the problem statement is clear.

17 Answers

Todd Benson

2/28/2008 12:08:00 PM

On Thu, Feb 28, 2008 at 12:40 AM, Chirantan <chirantan.rajhans@gmail.com> wrote:
> I have an html code into string. I want to retrieve the content (Can
> be any HTML code with any number of tags) present inside the div after
> the heading till the end of the div.
>
> Example,
>
> <div class="info">
> <h5>Tagline:</h5>
> Yippee Ki Yay Mo - John 6:27
> </div>
>
> <div class="info">
> <h5>Plot Outline:</h5>
> John McClane takes on an Internet-based terrorist organization who is
> systematically shutting down the United States. <a class="tn15more
> inline" href="http://www.imdb.com/title/tt0337978/plotsum...
> onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> link=/title/tt0337978/plotsummary';">more</a>
> </div>
>
>
> In the above example, Plot Outline is header that I am looking for
> then, regex should give me -
>
> John McClane takes on an Internet-based terrorist organization who is
> systematically shutting down the United States. <a class="tn15more
> inline" href="http://www.imdb.com/title/tt0337978/plotsum...
> onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> link=/title/tt0337978/plotsummary';">more</a>
>
>
> And if "Tagline:" is what I am looking for then regex should give me -
>
> Yippee Ki Yay Mo - John 6:27
>
> I hope the problem statement is clear.

Scraping html is not the easiest thing in the world. I would
recommend the hpricot library.

Todd

William James

2/28/2008 3:51:00 PM

On Feb 28, 12:36 am, Chirantan <chirantan.rajh...@gmail.com> wrote:
> I have an html code into string. I want to retrieve the content (Can
> be any HTML code with any number of tags) present inside the div after
> the heading till the end of the div.
>
> Example,
>
> <div class="info">
> <h5>Tagline:</h5>
> Yippee Ki Yay Mo - John 6:27
> </div>
>
> <div class="info">
> <h5>Plot Outline:</h5>
> John McClane takes on an Internet-based terrorist organization who is
> systematically shutting down the United States. <a class="tn15more
> inline" href="http://www.imdb.com/title/tt0337978/plotsum...
> onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> link=/title/tt0337978/plotsummary';">more</a>
> </div>
>
> In the above example, Plot Outline is header that I am looking for
> then, regex should give me -
>
> John McClane takes on an Internet-based terrorist organization who is
> systematically shutting down the United States. <a class="tn15more
> inline" href="http://www.imdb.com/title/tt0337978/plotsum...
> onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> link=/title/tt0337978/plotsummary';">more</a>
>
> And if "Tagline:" is what I am looking for then regex should give me -
>
> Yippee Ki Yay Mo - John 6:27
>
> I hope the problem statement is clear.

Note that this will give spurious results if an html comment happens
to contain what you are looking for.

def find_header header, html
# Put all of the DIVs in an array.
divs = html.scan( %r{<div.*?>(.*?)</div>}im ).flatten
divs.each{|s|
if s =~ %r{<h(\d)>#{header}</h\1>(.*)}im
return $2.strip
end
}
return nil
end

html = DATA.read

puts find_header( "Plot Outline:", html )

__END__
<div class="info">
<h5>Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

<div class="info">
<h5>Plot Outline:</h5>
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsum...
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>
</div>

mrt

2/28/2008 6:54:00 PM

A regex will break too easily when parsing HTML. A real parser will do
a much better job, and often be more concise and readable, too.

This does what you want:

#-------
require 'rubygems'
require 'hpricot'
@doc = Hpricot(html) # or Hpricot(open("filename"))

def find(term)
@doc.search("//div[@class='info']").each do |info|
header = info.search("h5").remove
if header.inner_text == term
puts info.inner_html
end
end
end
#-------

> find("Plot Outline:")
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a href="http://
www.imdb.com/title/tt0337978/plotsummary" class="tn15more
inline" onclick="(new Image()).src='/rg/title-tease/plotsummary/images/
b.gif?
link=/title/tt0337978/plotsummary';">more</a>

Mark

William James

2/28/2008 8:15:00 PM

On Feb 28, 9:50 am, William James <w_a_x_...@yahoo.com> wrote:
> On Feb 28, 12:36 am, Chirantan <chirantan.rajh...@gmail.com> wrote:
>
> > I have an html code into string. I want to retrieve the content (Can
> > be any HTML code with any number of tags) present inside the div after
> > the heading till the end of the div.
>
> > Example,
>
> > <div class="info">
> > <h5>Tagline:</h5>
> > Yippee Ki Yay Mo - John 6:27
> > </div>
>
> > <div class="info">
> > <h5>Plot Outline:</h5>
> > John McClane takes on an Internet-based terrorist organization who is
> > systematically shutting down the United States. <a class="tn15more
> > inline" href="http://www.imdb.com/title/tt0337978/plotsum...
> > onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> > link=/title/tt0337978/plotsummary';">more</a>
> > </div>
>
> > In the above example, Plot Outline is header that I am looking for
> > then, regex should give me -
>
> > John McClane takes on an Internet-based terrorist organization who is
> > systematically shutting down the United States. <a class="tn15more
> > inline" href="http://www.imdb.com/title/tt0337978/plotsum...
> > onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> > link=/title/tt0337978/plotsummary';">more</a>
>
> > And if "Tagline:" is what I am looking for then regex should give me -
>
> > Yippee Ki Yay Mo - John 6:27
>
> > I hope the problem statement is clear.
>
> Note that this will give spurious results if an html comment happens
> to contain what you are looking for.
>
> def find_header header, html
> # Put all of the DIVs in an array.
> divs = html.scan( %r{<div.*?>(.*?)</div>}im ).flatten
> divs.each{|s|
> if s =~ %r{<h(\d)>#{header}</h\1>(.*)}im
> return $2.strip
> end
> }
> return nil
> end
>
> html = DATA.read
>
> puts find_header( "Plot Outline:", html )
>
> __END__
> <div class="info">
> <h5>Tagline:</h5>
> Yippee Ki Yay Mo - John 6:27
> </div>
>
> <div class="info">
> <h5>Plot Outline:</h5>
> John McClane takes on an Internet-based terrorist organization who is
> systematically shutting down the United States. <a class="tn15more
> inline" href="http://www.imdb.com/title/tt0337978/plotsum...
> onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> link=/title/tt0337978/plotsummary';">more</a>
> </div>

More concise:

def find_header header, html
html.scan( %r{<div.*?>(.*?)</div>}im ).flatten.each{|s|
return $1.strip if s =~ %r{<h5>#{header}</h5(.*)}im }
return nil
end

Chirantan

2/29/2008 4:00:00 AM

On Feb 29, 1:14 am, William James <w_a_x_...@yahoo.com> wrote:
> On Feb 28, 9:50 am, William James <w_a_x_...@yahoo.com> wrote:
>
>
>
> > On Feb 28, 12:36 am, Chirantan <chirantan.rajh...@gmail.com> wrote:
>
> > > I have an html code into string. I want to retrieve the content (Can
> > > be any HTML code with any number of tags) present inside the div after
> > > the heading till the end of the div.
>
> > > Example,
>
> > > <div class="info">
> > > <h5>Tagline:</h5>
> > > Yippee Ki Yay Mo - John 6:27
> > > </div>
>
> > > <div class="info">
> > > <h5>Plot Outline:</h5>
> > > John McClane takes on an Internet-based terrorist organization who is
> > > systematically shutting down the United States. <a class="tn15more
> > > inline" href="http://www.imdb.com/title/tt0337978/plotsum...
> > > onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> > > link=/title/tt0337978/plotsummary';">more</a>
> > > </div>
>
> > > In the above example, Plot Outline is header that I am looking for
> > > then, regex should give me -
>
> > > John McClane takes on an Internet-based terrorist organization who is
> > > systematically shutting down the United States. <a class="tn15more
> > > inline" href="http://www.imdb.com/title/tt0337978/plotsum...
> > > onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> > > link=/title/tt0337978/plotsummary';">more</a>
>
> > > And if "Tagline:" is what I am looking for then regex should give me -
>
> > > Yippee Ki Yay Mo - John 6:27
>
> > > I hope the problem statement is clear.
>
> > Note that this will give spurious results if an html comment happens
> > to contain what you are looking for.
>
> > def find_header header, html
> > # Put all of the DIVs in an array.
> > divs = html.scan( %r{<div.*?>(.*?)</div>}im ).flatten
> > divs.each{|s|
> > if s =~ %r{<h(\d)>#{header}</h\1>(.*)}im
> > return $2.strip
> > end
> > }
> > return nil
> > end
>
> > html = DATA.read
>
> > puts find_header( "Plot Outline:", html )
>
> > __END__
> > <div class="info">
> > <h5>Tagline:</h5>
> > Yippee Ki Yay Mo - John 6:27
> > </div>
>
> > <div class="info">
> > <h5>Plot Outline:</h5>
> > John McClane takes on an Internet-based terrorist organization who is
> > systematically shutting down the United States. <a class="tn15more
> > inline" href="http://www.imdb.com/title/tt0337978/plotsum...
> > onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> > link=/title/tt0337978/plotsummary';">more</a>
> > </div>
>
> More concise:
>
> def find_header header, html
> html.scan( %r{<div.*?>(.*?)</div>}im ).flatten.each{|s|
> return $1.strip if s =~ %r{<h5>#{header}</h5(.*)}im }
> return nil
> end

Thank you William and Mark,

The codes worked. :-) Thanks a lot.

mrt

2/29/2008 1:50:00 PM

All the regex solutions provided will break with the following
perfectly valid HTML:

<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.

Florian Gilcher

2/29/2008 4:53:00 PM

On Feb 29, 2008, at 2:54 PM, Mark Thomas wrote:

> All the regex solutions provided will break with the following
> perfectly valid HTML:
>
> <div class="info">
> <h5 >Tagline:</h5>
> Yippee Ki Yay Mo - John 6:27
> </div>
>
> This is one of many reasons it is a BAD idea to use regexes to parse
> HTML. Regular expressions are simply not the right tool for the job.
>

Whats quite interesting is that I am not able to find a nice article
on _why_
this doesn't work. So, in short:

Regexp can only parse languages that are regular (hence the name) or -
in other words - a Type 3-language in the Chomsky hierarchy [1]. This
is a
rule of thumb because many Regexp-libraries nowadays implement
features that enable you to do more than formal regular expressions.
But for the typical use, it is true.

Regular languages do not have any possibility to "look behind". They
do only
look forward. This is the reason why you cannot define a regular
language to
describe an parse arbitrarily deep nested structure (an thus, no regular
expression):
You have no possibility to determine which closing tag matches a given
opening tag.

A more abstract example:
There is no (formal) regular expression that matches a word that
consists
of n times "a" and then n times "b":

ab
aabb
aaabbb
aaaabbbb
etc.

What you can do is extract a tag, push it on a stack, extract the
next one, etc. and pop them when encountering matching closing tags.
Tags
by itself can be described with regexps (afaik, this is how Textmate
does its
markup).

Greetings
Skade

[1] http://en.wikipedia.org/wiki/Chomsky...

Todd Benson

2/29/2008 5:36:00 PM

On Fri, Feb 29, 2008 at 10:52 AM, Florian Gilcher <flo@andersground.net> wrote:
>
>
> On Feb 29, 2008, at 2:54 PM, Mark Thomas wrote:
>
> > All the regex solutions provided will break with the following
> > perfectly valid HTML:
> >
> > <div class="info">
> > <h5 >Tagline:</h5>
> > Yippee Ki Yay Mo - John 6:27
> > </div>
> >
> > This is one of many reasons it is a BAD idea to use regexes to parse
> > HTML. Regular expressions are simply not the right tool for the job.
> >
>
> Whats quite interesting is that I am not able to find a nice article
> on _why_
> this doesn't work. So, in short:
>
> Regexp can only parse languages that are regular (hence the name) or -
> in other words - a Type 3-language in the Chomsky hierarchy [1]. This
> is a
> rule of thumb because many Regexp-libraries nowadays implement
> features that enable you to do more than formal regular expressions.
> But for the typical use, it is true.
>
> Regular languages do not have any possibility to "look behind". They
> do only
> look forward. This is the reason why you cannot define a regular
> language to
> describe an parse arbitrarily deep nested structure (an thus, no regular
> expression):
> You have no possibility to determine which closing tag matches a given
> opening tag.
>
> A more abstract example:
> There is no (formal) regular expression that matches a word that
> consists
> of n times "a" and then n times "b":
>
> ab
> aabb
> aaabbb
> aaaabbbb
> etc.
>
> What you can do is extract a tag, push it on a stack, extract the
> next one, etc. and pop them when encountering matching closing tags.
> Tags
> by itself can be described with regexps (afaik, this is how Textmate
> does its
> markup).
>
> Greetings
> Skade
>
> [1] http://en.wikipedia.org/wiki/Chomsky...

Thank you for that great explanation! I was waiting for someone to
bring up formal grammar, but I was afraid to, because I wasn't sure it
applied (not that familiar with how regexps actually work).

Todd

William James

2/29/2008 7:04:00 PM

On Feb 29, 7:50 am, Mark Thomas <m...@thomaszone.com> wrote:
> All the regex solutions provided will break with the following
> perfectly valid HTML:
>
> <div class="info">
> <h5 >Tagline:</h5>
> Yippee Ki Yay Mo - John 6:27
> </div>

Easily fixed.

def find_header header, html
html.scan( %r{<div.*?>(.*?)</div\s*>}im ).flatten.
each{|s|
return $1.strip if s =~ %r{<h5\s*>#{header}</h5\s*>(.*)}im }
return nil
end

>
> This is one of many reasons it is a BAD idea to use regexes to parse
> HTML. Regular expressions are simply not the right tool for the job.

Who told you that they are not? And why did you take his word for it?
Does hpricot use regular expressions?

William James

2/29/2008 7:14:00 PM

On Feb 29, 10:52 am, Florian Gilcher <f...@andersground.net> wrote:
> On Feb 29, 2008, at 2:54 PM, Mark Thomas wrote:
>
> > All the regex solutions provided will break with the following
> > perfectly valid HTML:
>
> > <div class="info">
> > <h5 >Tagline:</h5>
> > Yippee Ki Yay Mo - John 6:27
> > </div>
>
> > This is one of many reasons it is a BAD idea to use regexes to parse
> > HTML. Regular expressions are simply not the right tool for the job.
>
> Whats quite interesting is that I am not able to find a nice article
> on _why_
> this doesn't work. So, in short:
>
> Regexp can only parse languages that are regular (hence the name) or -
> in other words - a Type 3-language in the Chomsky hierarchy [1]. This
> is a
> rule of thumb because many Regexp-libraries nowadays implement
> features that enable you to do more than formal regular expressions.
> But for the typical use, it is true.
>
> Regular languages do not have any possibility to "look behind". They
> do only
> look forward. This is the reason why you cannot define a regular
> language to
> describe an parse arbitrarily deep nested structure (an thus, no regular
> expression):
> You have no possibility to determine which closing tag matches a given
> opening tag.
>
> A more abstract example:
> There is no (formal) regular expression that matches a word that
> consists
> of n times "a" and then n times "b":

And that doesn't matter much. One can use as many regular expressions
as he wishes.

>
> ab
> aabb
> aaabbb
> aaaabbbb
> etc.

"ab
xx
aabb
aaabbb
aaabb
aaaabbbb".split.each{|s|
if s.match(/^(a+)/) and s.match(/^a+b{#{$1.size}}$/)
puts s
else
puts '-'
end
}

Or one can use regular expression + code:

"ab
xx
aabb
aaabbb
aaabb
aaaabbbb".split.each{|s|
if s.match(/^(a+)(b+)$/) and $1.size == $2.size
puts s
else
puts '-'
end
}

What makes anyone think that a single regular expression
has to do all the work?

comp.lang.ruby

Need a regex searching html code

Chirantan

Todd Benson

William James

mrt

William James

Chirantan

mrt

Florian Gilcher

Todd Benson

William James

William James

x Login to ForumsZone