Asp Forum - How to extract texts from html source?

Sam Kong

5/9/2005 7:03:00 PM

Hi, all!

Quite often, when I need to read a list of web pages, I download the
html sources and save them in a single file like a.html.
If they are mostly texts, I open the html using web browser, select all
and copy it to an editor and save it.
I want to make the process shorter.
How can I extract the text from html source?
I'm sure there're many parsers for it.
What is the most convenient one?

Thanks.
Sam

13 Answers

james_b

5/9/2005 7:22:00 PM

Sam Kong wrote:
> Hi, all!
>
> Quite often, when I need to read a list of web pages, I download the
> html sources and save them in a single file like a.html.
> If they are mostly texts, I open the html using web browser, select all
> and copy it to an editor and save it.
> I want to make the process shorter.
> How can I extract the text from html source?
> I'm sure there're many parsers for it.
> What is the most convenient one?

Take a a look at Michael Neumann's WWW::Mechanize

http://www.ntecs.de/blog/Blog/WWW-Mech...
http://rubyforge.org/frs/?group_id=427&relea...

Or install the gem

James

>
> Thanks.
> Sam
>
>
> .
>

--

http://www.ru...
http://www.r...
http://catapult.rub...
http://orbjson.rub...
http://ooo4r.rub...
http://www.jame...

Brian Schröder

5/9/2005 7:37:00 PM

On 09/05/05, James Britt <james_b@neurogami.com> wrote:
> Sam Kong wrote:
> > Hi, all!
> >
> > Quite often, when I need to read a list of web pages, I download the
> > html sources and save them in a single file like a.html.
> > If they are mostly texts, I open the html using web browser, select all
> > and copy it to an editor and save it.
> > I want to make the process shorter.
> > How can I extract the text from html source?
> > I'm sure there're many parsers for it.
> > What is the most convenient one?
>
> Take a a look at Michael Neumann's WWW::Mechanize
>
> http://www.ntecs.de/blog/Blog/WWW-Mech...
> http://rubyforge.org/frs/?group_id=427&relea...
>
> Or install the gem
>
> James
>
> >
> > Thanks.
> > Sam
> >
> >
> > .
> >
>
> --
>
> http://www.ru...
> http://www.r...
> http://catapult.rub...
> http://orbjson.rub...
> http://ooo4r.rub...
> http://www.jame...
>
>

You don't need ruby for this:

$ apt-cache show w3m
Package: w3m
[snip]
Description: WWW browsable pager with excellent tables/frames support
w3m is a text-based World Wide Web browser with IPv6 support.
It features excellent support for tables and frames. It can be used
as a standalone file pager, too.
.
* You can follow links and/or view images in HTML.
* Internet message preview mode, you can browse HTML mail.
* You can follow links in plain text if it includes URL forms.
* With w3m-img, you can view image inline.
.
For more information,
see http://sourceforge.net/pr...

$ w3m -dump http://ruby.brian-sch...q... | head
A ruby a day!

Ruby Quiz Solutions (Amazing Mazes)

Amazing Mazes

For a full description see: (Amazing Mazes on Ruby Quiz Homepage)[http://
www.rubyquiz.com/quiz31.html]

Another graph algorithm. Create a maze that is fully connected and has only one
$

regards,

Brian

--
http://ruby.brian-sch...

multilingual _non rails_ ruby based vocabulary trainer:
http://www.vocabu... | http://www.g... | http://www.vok...

Sam Kong

5/9/2005 7:50:00 PM

James Britt wrote:
> Sam Kong wrote:
> > Hi, all!
> >
> > Quite often, when I need to read a list of web pages, I download
the
> > html sources and save them in a single file like a.html.
> > If they are mostly texts, I open the html using web browser, select
all
> > and copy it to an editor and save it.
> > I want to make the process shorter.
> > How can I extract the text from html source?
> > I'm sure there're many parsers for it.
> > What is the most convenient one?
>
>
> Take a a look at Michael Neumann's WWW::Mechanize
>
> http://www.ntecs.de/blog/Blog/WWW-Mech...
> http://rubyforge.org/frs/?group_id=427&relea...
>
> Or install the gem

Thank James.
That looks cool.
However, it doesn't seem to have a function to extract texts from html.
(Or did I miss it?)
What I want is...

<table><tr><td>TEST</td></tr></table> => TEST

Is there a module that does this?

Regards,
Sam

>
>
> James
>
> >
> > Thanks.
> > Sam
> >
> >
> > .
> >
>
>
> --
>
> http://www.ru...
> http://www.r...
> http://catapult.rub...
> http://orbjson.rub...
> http://ooo4r.rub...
> http://www.jame...

Sam Kong

5/9/2005 8:01:00 PM

Brian Schröder wrote:
> On 09/05/05, James Britt <james_b@neurogami.com> wrote:
> > Sam Kong wrote:
> > > Hi, all!
> > >
> > > Quite often, when I need to read a list of web pages, I download
the
> > > html sources and save them in a single file like a.html.
> > > If they are mostly texts, I open the html using web browser,
select all
> > > and copy it to an editor and save it.
> > > I want to make the process shorter.
> > > How can I extract the text from html source?
> > > I'm sure there're many parsers for it.
> > > What is the most convenient one?
> >
> > Take a a look at Michael Neumann's WWW::Mechanize
> >
> > http://www.ntecs.de/blog/Blog/WWW-Mech...
> > http://rubyforge.org/frs/?group_id=427&relea...
> >
> > Or install the gem
> >
> > James
> >
> > >
> > > Thanks.
> > > Sam
> > >
> > >
> > > .
> > >
> >
> > --
> >
> > http://www.ru...
> > http://www.r...
> > http://catapult.rub...
> > http://orbjson.rub...
> > http://ooo4r.rub...
> > http://www.jame...
> >
> >
>
> You don't need ruby for this:
>
> $ apt-cache show w3m
> Package: w3m
> [snip]
> Description: WWW browsable pager with excellent tables/frames support
> w3m is a text-based World Wide Web browser with IPv6 support.
> It features excellent support for tables and frames. It can be used
> as a standalone file pager, too.
> .
> * You can follow links and/or view images in HTML.
> * Internet message preview mode, you can browse HTML mail.
> * You can follow links in plain text if it includes URL forms.
> * With w3m-img, you can view image inline.
> .
> For more information,
> see http://sourceforge.net/pr...
>
> $ w3m -dump http://ruby.brian-sch...q... | head
> A ruby a day!

Oh, thanks.
I just realized that even lynx can do that.

Regards,
Sam

>
> Ruby Quiz Solutions (Amazing Mazes)
>
> Amazing Mazes
>
> For a full description see: (Amazing Mazes on Ruby Quiz
Homepage)[http://
> www.rubyquiz.com/quiz31.html]
>
> Another graph algorithm. Create a maze that is fully connected and
has only one
> $
>
> regards,
>
> Brian
>
> --
> http://ruby.brian-sch...
>
> multilingual _non rails_ ruby based vocabulary trainer:
> http://www.vocabu... | http://www.g... |
http://www.vok...

Tom Reilly

5/10/2005 2:08:00 AM

Several years ago, one of the members of the group offered me this
routine which does a pretty good job of
extracting the text from a html page.

#--------------------------------------------------------------------
# Strip HTML Tags from Line
#--------------------------------------------------------------------

def striphtml(line)
line.gsub(/\n/, ' ').gsub(/<.*?>/, '')
end

james_b

5/10/2005 2:50:00 AM

Sam Kong wrote:
> Thank James.
> That looks cool.
> However, it doesn't seem to have a function to extract texts from html.
> (Or did I miss it?)

No, it is a library for the (fairly) easy creation of HTML munging code.

Some coding is required, but it allows complete control (so you get just
the text of interest).

James

daz

5/10/2005 11:53:00 AM

Sam Kong wrote:
>
> [...] If they are mostly texts, I open the html using
> web browser, select all and copy it to an editor and save it.
>

Save As ... [text file].txt

- Removes all tags.
(Verified with Opera, Firefox & IE6, so I guess most browsers do this)
( e.g. test page: http://www... )

daz

Sam Kong

5/10/2005 3:56:00 PM

Yes, that's right...:)
I just want to do it all with my ruby program...hehe
Thanks anyway.

Sam

Sam Kong

5/10/2005 3:59:00 PM

Tom Reilly wrote:
> Several years ago, one of the members of the group offered me this
> routine which does a pretty good job of
> extracting the text from a html page.
>
> #--------------------------------------------------------------------
> # Strip HTML Tags from Line
> #--------------------------------------------------------------------
>
> def striphtml(line)
> line.gsub(/\n/, ' ').gsub(/<.*?>/, '')
> end

Thank you for sharing the code.
However, this code works only for a simple line, right?
When I tested it with a page of html by looping line by line, the
result was not what I expected.
Probably, I need to get a DOM parser...:-(

Sam

Ben Giddings

5/10/2005 5:15:00 PM

On Monday 09 May 2005 15:04, Sam Kong wrote:
> Hi, all!
>
> Quite often, when I need to read a list of web pages, I download the
> html sources and save them in a single file like a.html.
> If they are mostly texts, I open the html using web browser, select all
> and copy it to an editor and save it.
> I want to make the process shorter.
> How can I extract the text from html source?
> I'm sure there're many parsers for it.
> What is the most convenient one?

You may find my HTMLTokenizer library convenient for this. To do what you
need, all you'd do is keep calling "tokenizer.getText()"

http://rubyforge.org/projects/html...

Ben

comp.lang.ruby

How to extract texts from html source?

Sam Kong

james_b

Brian Schröder

Sam Kong

Sam Kong

Tom Reilly

james_b

daz

Sam Kong

Sam Kong

Ben Giddings

x Login to ForumsZone