Dave Burt
1/20/2006 4:06:00 AM
Sean Wolfe wrote:
> Does anyone know of an open-source library or dictionary file of Ignore
> Words for the English language? I tried googling, but I don't think I have
> the right term for what I'm looking for. I want a library or a possible
> dictionary that contains words that are common particles, pronouns, and
> modifiers, such as "and if when but of a the this that...". I am trying to
> design a keyword analyzer and was looking to see if there was some work
> already done out there, possibly in the Ruby world, that can help me
> eliminate these words from written content for indexing?
It seems to me the requirement to omit certain words is particularly
application-dependent. In some cases it may make sense to index all words.
Why not just index a typical sample, then remove words with too many hits?
(Would "too many" be words that appear in >90% of documents?)
Cheers,
Dave