Asp Forum - parsing html - comp.lang.ruby

Martin Pfeffer

10/24/2004 5:36:00 PM

hi
my problem is i need a file with german words and so i try to create a
file parsing html sites and write extracted words to a database so my
questizn is what is the easyest way to extract text from html pages?
thx
Martin

4 Answers

Stefan Schmiedl

10/24/2004 9:45:00 PM

On Sun, 24 Oct 2004 17:35:46 GMT,
Martin Pfeffer <udlduz@chello.at> wrote:
> hi
> my problem is i need a file with german words and so i try to create a
> file parsing html sites and write extracted words to a database so my
> questizn is what is the easyest way to extract text from html pages?
> thx
> Martin

there's a /usr/share/dict/ngerman on my Debian box
> wc ngerman
308860 308860 3998536 ngerman

which tells me that the average word length is about 13 (!) letters.
Unvorstellbar!

s.

Brian Schröder

10/25/2004 9:21:00 AM

If you don't mind senseless words like "img" that come from html markup:

--8<---
require 'open-uri' open('http://ruby.brian-schroed...).read.scan(/[-\wöäüß]+/i)
--8<---

If you have valid xhtml:

--8<---
require 'rexml/document'
require 'open-uri'

include REXML
Document.new(open('http://ruby.brian-schroed...)).
elements.to_a('//').
map{|e| e.texts.map{|t|t.value} }.
join(' ').
scan(/[-\wöäüß]+/i).
sort.
uniq
--8<---

hth,

Brian

PS: I'm shure the text-extraction with rexml can be done in a nicer/more efficent way.

On Mon, 25 Oct 2004 02:39:08 +0900
Martin Pfeffer <udlduz@chello.at> wrote:

> hi
> my problem is i need a file with german words and so i try to create a
> file parsing html sites and write extracted words to a database so my
> questizn is what is the easyest way to extract text from html pages?
> thx
> Martin
>

Ben Giddings

10/26/2004 9:13:00 PM

Martin Pfeffer wrote:
> my problem is i need a file with german words and so i try to create a
> file parsing html sites and write extracted words to a database so my
> questizn is what is the easyest way to extract text from html pages?

My "htmltokenizer" module (available on RAA and Rubyforge) is pretty
good at extracting text from HTML pages.

Ben

Alexander Kellett

10/27/2004 7:17:00 AM

On Wed, Oct 27, 2004 at 06:13:00AM +0900, Ben Giddings wrote:
> Martin Pfeffer wrote:
> >my problem is i need a file with german words and so i try to create a
> >file parsing html sites and write extracted words to a database so my
> >questizn is what is the easyest way to extract text from html pages?
>
> My "htmltokenizer" module (available on RAA and Rubyforge) is pretty
> good at extracting text from HTML pages.

aye. it rocks. thanks for that :)

Alex

comp.lang.ruby

parsing html

Martin Pfeffer

Stefan Schmiedl

Brian Schröder

Ben Giddings

Alexander Kellett

x Login to ForumsZone