[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

How to build an index of phrases in a phrase/sentence?

Dan Fitzpatrick

5/27/2005 6:40:00 PM

I am trying to build an indexing structure on some phrases. Most phrases
will have 2 - 5 parts (words). The resulting array will be dumped into
an index to find the matching phrases. I don't want to do wildcard
searching on the resulting array to find the phrase.

I would like to turn "This is some text" into

["This",
"This is",
"This is some",
"This is some text",
"is",
"is some",
"is some text",
"some",
"some text",
"text"]

The order of the resulting array doesn't matter. When someone searches
for "is some" or "some text", I want it to find this phrase. I don't
want a search for "is text" to find this phrase though.

My solution so far can find all but the middle elements. In this case,
"is some". But when the original phrase has more parts, then more middle
parts are not added to the array.

text = "This is some text"
#=> "This is some text"
ws = ''; text.split(/\W/).collect{|w| ws = (ws+' '+w).strip; ws}
#=> ["This", "This is", "This is some", "This is some text"]
ws = ''; text.split(/\W/).reverse.collect{|w| ws = (w+' '+ws).strip; ws}
#=> ["text", "some text", "is some text", "This is some text"]
text.split(/\W/).collect{|w| w}
=> ["This", "is", "some", "text"]

Is there an better Ruby way to do this? Or is there a better data
structure for retrieving a word or an exact phrase within a
phrase/sentence without wild-carding the search.

Thanks,

Dan


10 Answers

James Gray

5/27/2005 6:59:00 PM

0

On May 27, 2005, at 1:39 PM, Dan Fitzpatrick wrote:

> I am trying to build an indexing structure on some phrases. Most
> phrases will have 2 - 5 parts (words). The resulting array will be
> dumped into an index to find the matching phrases. I don't want to
> do wildcard searching on the resulting array to find the phrase.
>
> I would like to turn "This is some text" into
>
> ["This",
> "This is",
> "This is some",
> "This is some text",
> "is",
> "is some",
> "is some text",
> "some",
> "some text",
> "text"]
>
> The order of the resulting array doesn't matter. When someone
> searches for "is some" or "some text", I want it to find this
> phrase. I don't want a search for "is text" to find this phrase
> though.

Perhaps I'm not understanding, but how does the following fail to
meet your needs?

"This is some text".match(/is some/i)

James Edward Gray II



Ara.T.Howard

5/27/2005 7:19:00 PM

0

Dan Fitzpatrick

5/27/2005 7:20:00 PM

0

I have a few million phrases to index so I don't want to loop through
every phrase and test it with a regular expression.

Below is my working solution. Sometimes just describing the problem
helps the brain to take a new approach:

text = "This is some text"
parts = text.split(/\W/)
phrases = []
parts.each_index do |s|
s.upto(parts.size - 1){|e| phrases.push(parts.slice(s..e))}
end
phrases.each{|p| puts p.join(' ')}

Dan



James Edward Gray II wrote:
> On May 27, 2005, at 1:39 PM, Dan Fitzpatrick wrote:
>
>> I am trying to build an indexing structure on some phrases. Most
>> phrases will have 2 - 5 parts (words). The resulting array will be
>> dumped into an index to find the matching phrases. I don't want to do
>> wildcard searching on the resulting array to find the phrase.
>>
>> I would like to turn "This is some text" into
>>
>> ["This",
>> "This is",
>> "This is some",
>> "This is some text",
>> "is",
>> "is some",
>> "is some text",
>> "some",
>> "some text",
>> "text"]
>>
>> The order of the resulting array doesn't matter. When someone
>> searches for "is some" or "some text", I want it to find this phrase.
>> I don't want a search for "is text" to find this phrase though.
>
>
> Perhaps I'm not understanding, but how does the following fail to meet
> your needs?
>
> "This is some text".match(/is some/i)
>
> James Edward Gray II
>
>


Ara.T.Howard

5/27/2005 7:42:00 PM

0

luc

5/27/2005 7:51:00 PM

0

Full Inverted Index maybe?

Dan Fitzpatrick wrote:
>
> I have a few million phrases to index so I don't want to loop through
> every phrase and test it with a regular expression.
>
> Below is my working solution. Sometimes just describing the problem
> helps the brain to take a new approach:
>
> text = "This is some text"
> parts = text.split(/\W/)
> phrases = []
> parts.each_index do |s|
> s.upto(parts.size - 1){|e| phrases.push(parts.slice(s..e))}
> end
> phrases.each{|p| puts p.join(' ')}
>
> Dan
>
>
>
> James Edward Gray II wrote:
>
>> On May 27, 2005, at 1:39 PM, Dan Fitzpatrick wrote:
>>
>>> I am trying to build an indexing structure on some phrases. Most
>>> phrases will have 2 - 5 parts (words). The resulting array will be
>>> dumped into an index to find the matching phrases. I don't want to
>>> do wildcard searching on the resulting array to find the phrase.
>>>
>>> I would like to turn "This is some text" into
>>>
>>> ["This",
>>> "This is",
>>> "This is some",
>>> "This is some text",
>>> "is",
>>> "is some",
>>> "is some text",
>>> "some",
>>> "some text",
>>> "text"]
>>>
>>> The order of the resulting array doesn't matter. When someone
>>> searches for "is some" or "some text", I want it to find this
>>> phrase. I don't want a search for "is text" to find this phrase though.
>>
>>
>>
>> Perhaps I'm not understanding, but how does the following fail to
>> meet your needs?
>>
>> "This is some text".match(/is some/i)
>>
>> James Edward Gray II
>>
>>
>
>

Gene Tani

5/27/2005 9:38:00 PM

0

Besides glimpse, there's a bunch of indexers in RAA and elsewhere,
Gonzui, RSI, Odeum, cli:

http://rsi.rubyforge.org/files/lib/rsi/rsi_int...
http://gonzui.sourceforge.net/ind...
http://clabs.or...
http://www.zedshaw.com/projects/r...

mayb also look at the python indexers and Bayesian classifiers at
www.divmod.org
You didn't talk about stemming, discarding particles like "a",
"the",etc, issues like that but this should get you started

Luca Pireddu

5/28/2005 3:06:00 AM

0

Dan Fitzpatrick wrote:

> I am trying to build an indexing structure on some phrases. Most phrases
> will have 2 - 5 parts (words). The resulting array will be dumped into
> an index to find the matching phrases. I don't want to do wildcard
> searching on the resulting array to find the phrase.
>
> I would like to turn "This is some text" into
>
> ["This",
> "This is",
> "This is some",
> "This is some text",
> "is",
> "is some",
> "is some text",
> "some",
> "some text",
> "text"]
>
> The order of the resulting array doesn't matter. When someone searches
> for "is some" or "some text", I want it to find this phrase. I don't
> want a search for "is text" to find this phrase though.
>
> My solution so far can find all but the middle elements. In this case,
> "is some". But when the original phrase has more parts, then more middle
> parts are not added to the array.
>
> text = "This is some text"
> #=> "This is some text"
> ws = ''; text.split(/\W/).collect{|w| ws = (ws+' '+w).strip; ws}
> #=> ["This", "This is", "This is some", "This is some text"]
> ws = ''; text.split(/\W/).reverse.collect{|w| ws = (w+' '+ws).strip; ws}
> #=> ["text", "some text", "is some text", "This is some text"]
> text.split(/\W/).collect{|w| w}
> => ["This", "is", "some", "text"]
>
> Is there an better Ruby way to do this? Or is there a better data
> structure for retrieving a word or an exact phrase within a
> phrase/sentence without wild-carding the search.
>
> Thanks,
>
> Dan

I think what you want is a suffix tree.

http://en.wikipedia.org/wiki/S...
http://www.google.com/search?q=suffix+tree&ie=UTF-8&am...

Luca

Gavin Kistner

5/28/2005 3:53:00 AM

0

Oops, I sent that last reply without performance results.

I loaded in a file with 496 words and calculated all the sub-phrases,
and measured the time needed to 'parse' the information, and how much
memory was used. I then timed how long it took to match a sub-phrase
in the middle of the file, and also to 'find' a phrase that didn't
exist.

user system total real
create array: 12.910000 0.950000 13.860000 ( 15.403606)
run 100: 15.570000 0.220000 15.790000 ( 17.475724)
array - 158MB of VM

user system total real
create set: 16.040000 1.100000 17.140000 ( 21.742738)
run 100k: 0.910000 0.000000 0.910000 ( 1.088728)
set - 159MB of VM

user system total real
create matcher: 85.430000 1.340000 86.770000 ( 96.524512)
run 100k: 10.050000 0.160000 10.210000 ( 11.245722)
matcher - 68MB of VM

Note that the array was measuring only 100 iterations of the 2-phrase
match, while the other two performed 100 *thousand* iterations.


The array technique is thus over 10,000 slower to find a match than
the technique using the Set. and about 1,000 slower than the Trie
version, but does setup the data structure more quickly than either.


The Trie method's memory use should also increase at a slower rate
than the others as the source string increases, but I don't know how
to use Ruby to measure VM of a process, so I can't make a simple
automated test to graph this.


--
"Despite the surge of power you feel upon learning Ruby,
resist the urge to trip others or slap them in the bald head.
DO NOT LORD YOUR RUBYNESS OVER OTHERS!"
- Why the Lucky Stiff

Gavin Kistner

5/28/2005 3:53:00 AM

0

One last followup (sorry, I'm bored onboard a plane) :)

I did one manual test of RAM comparing the VM used by the Set storage
versus the Trie storage, comparing the previously-measured 496 word
document with a document that had 1007 words. The results were as I
expected:

469 words:
create set: 16.040000 1.100000 17.140000 ( 21.742738)
159MB of VM

create matcher: 85.430000 1.340000 86.770000 ( 96.524512)
68MB of VM


1007 words:
create set: 137.470000 9.400000 146.870000 (166.828737)
~1GB of VM

create matcher: 746.690000 11.050000 757.740000 (806.450292)
149MB of VM

Conclusion: if you have the RAM to spare, the Set-based approach is
quite speedy, but it gets greedy as your full phrase base grows. If
you need to save some memory and can spare the time, go with the Trie
based approach.



Now, having done all this work...if all you want is sub-phrase
matching, why not use a regexp?


469 words:
user system total real
create clean string: 0.010000 0.010000 0.020000 ( 0.003050)
run 100k matches: 10.750000 0.140000 10.890000 ( 15.839430)
28MB of VM

1007 words:
user system total real
create clean string: 0.010000 0.010000 0.020000 ( 0.432572)
run 100k matches: 19.350000 0.200000 19.550000 ( 27.612700)
28MB of VM



[Slim:~/Desktop/Match Phrases] gavinkis% cat regexp.rb
require 'benchmark'

cleaned = nil
matcher = Regexp.new( "\\b#{ARGV[1]}\\b" )

Benchmark.bm( 20 ){ |x|
x.report( "create clean string:" ){
cleaned = IO.read( ARGV[0] ).downcase.scan( /[a-z']
+/ ).join( ' ' )
}
x.report( "run 100k matches:"){
100_000.times{
cleaned =~ matcher
cleaned =~ /the brown fox/
}
}
}



Gavin Kistner

5/28/2005 3:53:00 AM

0

On May 27, 2005, at 12:39 PM, Dan Fitzpatrick wrote:
> I would like to turn "This is some text" into
>
> ["This",
> "This is",
> "This is some",
> "This is some text",
> "is",
> "is some",
> "is some text",
> "some",
> "some text",
> "text"]

Direct Answer
-------------------------------------------------
def phrases( string )
pieces = string.split( /\s/ )
out = []
pieces.each_index{ |start_index|
(start_index+1).upto( pieces.length ){ |end_index|
out << pieces[start_index...end_index].join(' ')
}
}
out
end

all = phrases( "It's the end of the world as we know it." )
p all
#=> ["It's", "It's the", "It's the end", "It's the end of", "It's the
end of the", "It's the end of the world", "It's the end of the world
as", "It's the end of the world as we", "It's the end of the world as
we know", "It's the end of the world as we know it.", "the", "the
end", "the end of", "the end of the", "the end of the world", "the
end of the world as", "the end of the world as we", "the end of the
world as we know", "the end of the world as we know it.", "end", "end
of", "end of the", "end of the world", "end of the world as", "end of
the world as we", "end of the world as we know", "end of the world as
we know it.", "of", "of the", "of the world", "of the world as", "of
the world as we", "of the world as we know", "of the world as we know
it.", "the", "the world", "the world as", "the world as we", "the
world as we know", "the world as we know it.", "world", "world as",
"world as we", "world as we know", "world as we know it.", "as", "as
we", "as we know", "as we know it.", "we", "we know", "we know it.",
"know", "know it.", "it."]

p all.include?( "end of the world" )
#=> true

p all.include?( "end the world" )
#=> false



Better/Faster
-------------------------------------------------
def phrase_matches( string )
require 'set'
pieces = string.split( /\s/ )
out = Set.new
pieces.each_index{ |start_index|
(start_index+1).upto( pieces.length ){ |end_index|
out.add( pieces[start_index...end_index].join(' ') )
}
}
out
end

all = phrase_matches( "It's the end of the world as we know it." )
p all
p all.include?( "end of the world" )
p all.include?( "end the world" )



Complex-But-Memory-Efficient Answer
-------------------------------------------------
class TrieNode
attr_accessor :children

def initialize
@children = {}
end

def add_path( array )
node = self
array.each{ |item| node = node.children[ item ] ||=
TrieNode.new }
end

def includes_path?( array )
node = self
array.each{ |item| return false unless node = node.children
[ item ] }
true
end

def to_hier( depth=0 )
tabs = "-"*depth
out = ''
@children.each{ |char,node|
out << "#{tabs}#{char}\n"
out << node.to_hier( depth+1 )
}
out
end
end

class PhraseMatcher
MATCH_WORDS = /[a-z']+/

def initialize( string )
@root = TrieNode.new
pieces = string.downcase.scan( MATCH_WORDS )
pieces.each_index{ |start_index|
(start_index+1).upto( pieces.length ){ |end_index|
@root.add_path( pieces[start_index...end_index] )
}
}
end

def includes_phrase?( string )
@root.includes_path?( string.scan( MATCH_WORDS) )
end
end

sub_phrases = PhraseMatcher.new( "It's the end of the world as we
know it, and I feel fine." )

p sub_phrases.includes_phrase?( "end of the world" )
#=> true

p sub_phrases.includes_phrase?( "end the world" )
#=> false

sub_phrases.instance_eval{ puts @root.to_hier }
#=> it
#=> -and
#=> --i
#=> ---feel
#=> ----fine
#=> world
#=> -as
#=> --we
#=> ---know
#=> ----it
#=> -----and
#=> ------i
#=> -------feel
#=> --------fine
#=> and
#=> -i
#=> --feel
#=> ---fine
#=> of
#=> -the
#=> --world
#=> ---as
#=> ----we
#=> -----know
#=> ------it
#=> -------and
#=> --------i
#=> ---------feel
#=> ----------fine
#=> it's
#=> -the
#=> --end
#=> ---of
#=> ----the
#=> -----world
#=> ------as
#=> -------we
#=> --------know
#=> ---------it
#=> ----------and
#=> -----------i
#=> ------------feel
#=> -------------fine
#=> we
#=> -know
#=> --it
#=> ---and
#=> ----i
#=> -----feel
#=> ------fine
#=> know
#=> -it
#=> --and
#=> ---i
#=> ----feel
#=> -----fine
#=> end
#=> -of
#=> --the
#=> ---world
#=> ----as
#=> -----we
#=> ------know
#=> -------it
#=> --------and
#=> ---------i
#=> ----------feel
#=> -----------fine
#=> the
#=> -world
#=> --as
#=> ---we
#=> ----know
#=> -----it
#=> ------and
#=> -------i
#=> --------feel
#=> ---------fine
#=> -end
#=> --of
#=> ---the
#=> ----world
#=> -----as
#=> ------we
#=> -------know
#=> --------it
#=> ---------and
#=> ----------i
#=> -----------feel
#=> ------------fine
#=> as
#=> -we
#=> --know
#=> ---it
#=> ----and
#=> -----i
#=> ------feel
#=> -------fine
#=> fine
#=> i
#=> -feel
#=> --fine
#=> feel
#=> -fine