[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Dictionary of Known English ignore words

Sean Wolfe

1/20/2006 3:28:00 AM

Does anyone know of an open-source library or dictionary file of Ignore
Words for the English language? I tried googling, but I don't think I
have the right term for what I'm looking for. I want a library or a
possible dictionary that contains words that are common particles,
pronouns, and modifiers, such as "and if when but of a the this
that...". I am trying to design a keyword analyzer and was looking to
see if there was some work already done out there, possibly in the Ruby
world, that can help me eliminate these words from written content for
indexing?

Thanks.

Sean

3 Answers

Dave Burt

1/20/2006 4:06:00 AM

0

Sean Wolfe wrote:
> Does anyone know of an open-source library or dictionary file of Ignore
> Words for the English language? I tried googling, but I don't think I have
> the right term for what I'm looking for. I want a library or a possible
> dictionary that contains words that are common particles, pronouns, and
> modifiers, such as "and if when but of a the this that...". I am trying to
> design a keyword analyzer and was looking to see if there was some work
> already done out there, possibly in the Ruby world, that can help me
> eliminate these words from written content for indexing?

It seems to me the requirement to omit certain words is particularly
application-dependent. In some cases it may make sense to index all words.
Why not just index a typical sample, then remove words with too many hits?
(Would "too many" be words that appear in >90% of documents?)

Cheers,
Dave


Gene Tani

1/20/2006 4:12:00 AM

0


Sean Wolfe wrote:
> Does anyone know of an open-source library or dictionary file of Ignore
> Words for the English language? I tried googling, but I don't think I
> have the right term for what I'm looking for. I want a library or a
> possible dictionary that contains words that are common particles,
> pronouns, and modifiers, such as "and if when but of a the this
> that...". I am trying to design a keyword analyzer and was looking to
> see if there was some work already done out there, possibly in the Ruby
> world, that can help me eliminate these words from written content for
> indexing?
>
> Thanks.
>
> Sean

http://esl.about.com/library/vocabulary/bl1000...
Most of the papers about text indexing, LSI etc that i've seen use
between 50 and 500 stopwords, depending on how they're stemming terms).
There's a few text index libs in ruby and python,some have their own
stopword lists (I don't remember which)
http://rubyforge.org/pro...

http://raa.ruby-lang.org/proj...

http://raa.ruby-lang.org/projec...

http://www.zedshaw.com/projects/ruby_odeum/...

http://raa.ruby-lang.org/project/sim...

http://hinegardner.org...

Erik Terpstra

1/20/2006 8:56:00 AM

0

Sean Wolfe wrote:
> Does anyone know of an open-source library or dictionary file of Ignore
> Words for the English language? I tried googling, but I don't think I
> have the right term for what I'm looking for. I want a library or a
> possible dictionary that contains words that are common particles,
> pronouns, and modifiers, such as "and if when but of a the this
> that...". I am trying to design a keyword analyzer and was looking to
> see if there was some work already done out there, possibly in the Ruby
> world, that can help me eliminate these words from written content for
> indexing?
>
> Thanks.
>
> Sean
>

Download libots, unpack it, it should contain some files with ignore
words for several languages.

http://libots.sourc...

Cheers,

Erik.