[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Text chunking?

Claus Spitzer

3/7/2005 7:09:00 PM

Greetings! I am working on a program to extract sentences from e-mails
for parsing, and was wondering if I could get some input before I
ended up taking a brute force approach. I got as far as recursing the
mail directories and using rmail to extract the body from each
e-mail... So the last part that I would need to do is split each body
into individual sentences. The naive approach would be to split at
every '.', but I'd like to know if there was something smarter
available - I took a quick glance at RAA and the gems package list and
couldn't find anything relevant, though it is not impossible that I
might have missed something. Comments?

For the curious ones - below is a snippet of my current program
(Disclaimer: It was a 5-minute job - I apologize for any horrors in
in):

------8<----------
require "rubygems"
require_gem "rmail"

class Chunker
attr_accessor :path

def initialize( path = "~/tmp/enron/maildir" )
$path = File.expand_path( path )
end

def recurse
# puts "Recursing #{$path}"
Dir["#{$path}/**/*"].each do |entry|
# skip spamassassin files
chunk( entry ) unless entry =~ /.SA$/
end
end

def strip_headers( source )
file = File.open( source )
message = RMail::Parser.read( file )
file.close
message
end
def strip_headers( source )
file = File.open( source )
message = RMail::Parser.read( file )
file.close
message
end

def chunk( source )
message = strip_headers( source )
puts message.body
end
end

chunker = Chunker.new( "~/tmp/enron/maildir/allen-p/straw" )
chunker.recurse

------8<----------

$ls ~/tmp/enron/maildir/allen-p/straw
produces the following:
.. 1. 2. 2..SA 3. 4. 5. 6. 7. 8.


4 Answers

Simon Strandgaard

3/7/2005 10:20:00 PM

0

On Tue, 8 Mar 2005 04:08:56 +0900, Claus Spitzer
<docboobenstein@gmail.com> wrote:
[snip]


Can you show an example of what you had in mind?


maybe this can help you?

'ab.cd.e'.scan(/.*?(?:\.|\z)/) #-> ["ab.", "cd.", "e", ""]


--
Simon Strandgaard


Claus Spitzer

3/7/2005 11:41:00 PM

0

Greetings!
Example... Sure. Let's consider the following text:

-----8<-----

I highly recommend David Moore's book "The Roman Pantheon" at $25.00 - a
very thorough research into the uses and development of Roman Cement....lime
and clay/pozzolonic ash; the making and uses of lime in building. The book
covers ancient kilns, and ties it all to modern uses of cement and concrete.

-----8<-----

Ideally I would like to get an array of strings out of this, each one
being a sentence. If I split at every '.', then the first sentence
will be cut off at $25.00. I might also need the the quotes escaped,
since something like
"I highly recommend David Moore's book "The Roman Pantheon" at $25.00"
could be troublesome. These sentences will then be passed to another
parser (Link Grammar), with which I then extract the
verb-subject/object relations which are _then_ used in my work. Mind
you, I don't need _every_ sentence to be perfect - These are ~3GB of
e-mails, and who knows what grammatical horrors are lurking in there.
My goal is to just be able to extract _some_ relations (the target
number lying at about 500,000).

Again, I was just wondering if something like that already existed for
Ruby, since that would save me a few days worth of a) Finding a
chunker in another language, and b) Writing a Ruby wrapper for it. But
if there isn't, then that's not the end of the world.

Regards...


On Tue, 8 Mar 2005 07:19:51 +0900, Simon Strandgaard <neoneye@gmail.com> wrote:
> On Tue, 8 Mar 2005 04:08:56 +0900, Claus Spitzer
> <docboobenstein@gmail.com> wrote:
> [snip]
>
> Can you show an example of what you had in mind?
>
> maybe this can help you?
>
> 'ab.cd.e'.scan(/.*?(?:\.|\z)/) #-> ["ab.", "cd.", "e", ""]
>
> --
> Simon Strandgaard
>
>


Cassio Pennachin

3/7/2005 11:52:00 PM

0

Claus,

> Again, I was just wondering if something like that already existed for
> Ruby, since that would save me a few days worth of a) Finding a
> chunker in another language, and b) Writing a Ruby wrapper for it. But
> if there isn't, then that's not the end of the world.

You may want to take a look at Lingua::Sentence, at:

http://www.pressur...

I haven't actually used this module, and sentence identification is a
bit tricky. But Chad Fowler uses it for some fun stuff:

http://www.chadfowler.com/index.cgi/Computing/Programming/Ruby/HaikuInTh...

HTH,
Cassio


Claus Spitzer

3/8/2005 1:06:00 AM

0

Awesome! Thanks Cassio, I'll check it out.

On Tue, 8 Mar 2005 08:51:53 +0900, Cassio Pennachin <pennachin@gmail.com> wrote:
> Claus,
>
> > Again, I was just wondering if something like that already existed for
> > Ruby, since that would save me a few days worth of a) Finding a
> > chunker in another language, and b) Writing a Ruby wrapper for it. But
> > if there isn't, then that's not the end of the world.
>
> You may want to take a look at Lingua::Sentence, at:
>
> http://www.pressur...
>
> I haven't actually used this module, and sentence identification is a
> bit tricky. But Chad Fowler uses it for some fun stuff:
>
> http://www.chadfowler.com/index.cgi/Computing/Programming/Ruby/HaikuInTh...
>
> HTH,
> Cassio
>
>