Srijayanth Sridhar
5/6/2009 10:16:00 AM
[Note: parts of this message were removed to make it a legal post.]
I know your boss and whoever it is who is dangling your carrots won't let
you use Hpricot, but tell him you will use Hpricot to get properly formatted
html and then write a parser to parse the properly formatted html. Even he
can't be opposed to that(seeing as how he wants you to reinvent wheels).
That way you can get rid of your whitespace problem and deal with the cosmos
at large.
Jayanth
On Wed, May 6, 2009 at 12:35 PM, Simon Krahnke <overlord@gmx.li> wrote:
> * Arun Kumar <arunkumar@innovaturelabs.com> (07:54) schrieb:
>
> > Hi,
> > Previously I posted a topic on how to strip all html tags and getting
> > the remaining text using regexp. Luckily I got one. This is the regexp:
> >
> > /([^>]*)(?=<[^>]*?>)/im
>
> And what do you do with this regexp?
>
> > In this case I'm able to get all the data between the html tags. But one
> > small problem.
>
> Hasn't everybody told you, there are problems with parsing HTML with
> regexps?
>
> > This is the output which I get when I parse the html content of
> > example.com using the above regexp. Here you can see some white space
> > between the data(ie. between 'Example web page' and 'You have
> > reached...'. These whitespaces are generated in place of the html tags
> > which I avoided using the above regexp.
>
> Really? Aren't they just from all the meaningless whitespace that's in
> a typical HTML document?
>
> > I want to remove those
> > whitespaces. I think that modifying the above regexp will give me the
> > right output without white spaces. Can somebody please help me.
>
> There are easy ways to strip all the whitespace, which is certainly not
> what you want, and there is a simple way to reduce all runs of whitespace
> by just one space (gsub(/\s+/, ' '), which probably also not what you
> want.
>
> Selectively removing some of the whitespace isn't easy at all, but it is
> probably a lot easier with a real HTML parser.
>
> mfg, simon .... l
>
>