Asp Forum - Histogram-type Data

Charles L. Snyder

6/21/2005 10:50:00 PM

Hi,
I have a text-parsing question -

array1 = ["Joe lives in France", "Sue goes to Japan", "Hiro walked around
France", "Bill moved to England"]
a_countries = ["France", "Japan", "England", "Spain"]

desired result is unique list (hash table) of all countries mentioned in
the sentences in array1, with frequency of occurence - eg,

France 2
Japan 1
England 1
Spain 0

neither of these (below) works:

a_countries.each do |h|
x = array1.inject([]){|menu,e| menu<< h if country_list.include?(e);menu}
end

x = array1.inject([]){|menu,e| menu<< $& if /#{a_countries.each}/ =~ e;
menu}

6 Answers

James Gray

6/22/2005 12:56:00 AM

On Jun 21, 2005, at 5:50 PM, Charles L. Snyder wrote:

> Hi,
> I have a text-parsing question -
>
> array1 = ["Joe lives in France", "Sue goes to Japan", "Hiro walked
> around
> France", "Bill moved to England"]
> a_countries = ["France", "Japan", "England", "Spain"]
>
> desired result is unique list (hash table) of all countries
> mentioned in
> the sentences in array1, with frequency of occurence - eg,
>
> France 2
> Japan 1
> England 1
> Spain 0

See if this gives you some fresh ideas:

irb(main):001:0> array1 = ["Joe lives in France", "Sue goes to
Japan", "Hiro walked around France", "Bill moved to England"]
=> ["Joe lives in France", "Sue goes to Japan", "Hiro walked around
France", "Bill moved to England"]
irb(main):002:0> a_countries = ["France", "Japan", "England", "Spain"]
=> ["France", "Japan", "England", "Spain"]
irb(main):003:0> counts = Hash.new
=> {}
irb(main):004:0> a_countries.each { |c| counts[c] = array1.find_all
{ |line| line =~ /\b#{c}\b/ }.size }
=> ["France", "Japan", "England", "Spain"]
irb(main):005:0> counts
=> {"Spain"=>0, "France"=>2, "Japan"=>1, "England"=>1}

Hope that helps.

James Edward Gray II

Florian Frank

6/22/2005 2:50:00 AM

Charles L. Snyder wrote:

> I have a text-parsing question -
>
>array1 = ["Joe lives in France", "Sue goes to Japan", "Hiro walked around
>France", "Bill moved to England"]
>a_countries = ["France", "Japan", "England", "Spain"]
>
>
puts a_countries.inject({}) {|h,c| h[c] = array1.inject(0) { |s,a| s +
a.scan(c).size }; h }.sort_by { |_,v| -v }.map { |a| "%10s%10u" % a }

--
Florian Frank

Robert Klemme

6/22/2005 7:01:00 AM

"Florian Frank" <flori@nixe.ping.de> schrieb im Newsbeitrag
news:42B8D1A3.4040005@nixe.ping.de...
> Charles L. Snyder wrote:
>
>> I have a text-parsing question -
>>
>>array1 = ["Joe lives in France", "Sue goes to Japan", "Hiro walked around
>>France", "Bill moved to England"]
>>a_countries = ["France", "Japan", "England", "Spain"]
>>
> puts a_countries.inject({}) {|h,c| h[c] = array1.inject(0) { |s,a| s +
> a.scan(c).size }; h }.sort_by { |_,v| -v }.map { |a| "%10s%10u" % a }

If I'm not mistaken your variant iterates array1 a_countries.size number of
times. For large arrays this might be quite inefficient. This is what I'd
do:

counts = a_countries.inject({}){|h,c| h[c]=0; h}
array1.each do |text|
text.scan(/\w+/i) {|m| c = counts[m] and counts[m] = c+1}
end

You could even optimize away one hash lookups per found word by doing:

class Counter
def initialize(n=0) @n=n end
def inc() @n+=1 end
def to_int() @n end
end

counts = a_countries.inject({}){|h,c| h[c]=Counter.new; h}
array1.each do |text|
text.scan(/\w+/i) {|m| c = counts[m] and c.inc}
end

Kind regards

robert

Florian Frank

6/22/2005 7:49:00 AM

Robert Klemme wrote:

>
> "Florian Frank" <flori@nixe.ping.de> schrieb im Newsbeitrag
> news:42B8D1A3.4040005@nixe.ping.de...
>
>> Charles L. Snyder wrote:
>>
>>> I have a text-parsing question -
>>>
>>> array1 = ["Joe lives in France", "Sue goes to Japan", "Hiro walked
>>> around
>>> France", "Bill moved to England"]
>>> a_countries = ["France", "Japan", "England", "Spain"]
>>>
>> puts a_countries.inject({}) {|h,c| h[c] = array1.inject(0) { |s,a| s
>> + a.scan(c).size }; h }.sort_by { |_,v| -v }.map { |a| "%10s%10u" % a }
>
>
> If I'm not mistaken your variant iterates array1 a_countries.size
> number of times. For large arrays this might be quite inefficient.
> This is what I'd do:
>
> counts = a_countries.inject({}){|h,c| h[c]=0; h}
> array1.each do |text|
> text.scan(/\w+/i) {|m| c = counts[m] and counts[m] = c+1}
> end

But, but,... you used more than one line!? Isn't this against the rules?

> You could even optimize away one hash lookups per found word by doing:
>
> class Counter
> def initialize(n=0) @n=n end
> def inc() @n+=1 end
> def to_int() @n end
> end

Without benchmarking (yeah) it's difficult to say, if this is really faster.

> counts = a_countries.inject({}){|h,c| h[c]=Counter.new; h}
> array1.each do |text|
> text.scan(/\w+/i) {|m| c = counts[m] and c.inc}
> end

But I am sure calling the block for every WORD of a text and having a
hash lookup for each word, is quite inefficient for larger texts. What
about:

array1.inject(Hash.new(0)){|c,t|$re||=/\b(#{a_countries.join('|')})\b/;t.scan($re){|x|c[x]+=1};c}

Note especially, the one line. ;)

--
Florian Frank

Robert Klemme

6/22/2005 9:17:00 AM

Florian Frank wrote:
> Robert Klemme wrote:
>
>>
>> "Florian Frank" <flori@nixe.ping.de> schrieb im Newsbeitrag
>> news:42B8D1A3.4040005@nixe.ping.de...
>>
>>> Charles L. Snyder wrote:
>>>
>>>> I have a text-parsing question -
>>>>
>>>> array1 = ["Joe lives in France", "Sue goes to Japan", "Hiro walked
>>>> around
>>>> France", "Bill moved to England"]
>>>> a_countries = ["France", "Japan", "England", "Spain"]
>>>>
>>> puts a_countries.inject({}) {|h,c| h[c] = array1.inject(0) { |s,a| s
>>> + a.scan(c).size }; h }.sort_by { |_,v| -v }.map { |a| "%10s%10u" %
>>> a }
>>
>>
>> If I'm not mistaken your variant iterates array1 a_countries.size
>> number of times. For large arrays this might be quite inefficient.
>> This is what I'd do:
>>
>> counts = a_countries.inject({}){|h,c| h[c]=0; h}
>> array1.each do |text|
>> text.scan(/\w+/i) {|m| c = counts[m] and counts[m] = c+1}
>> end
>
> But, but,... you used more than one line!? Isn't this against the
> rules?

Well... I can squeeze it on one line. :-)

>> You could even optimize away one hash lookups per found word by
>> doing:
>>
>> class Counter
>> def initialize(n=0) @n=n end
>> def inc() @n+=1 end
>> def to_int() @n end
>> end
>
> Without benchmarking (yeah) it's difficult to say, if this is really
> faster.
>
>> counts = a_countries.inject({}){|h,c| h[c]=Counter.new; h}
>> array1.each do |text|
>> text.scan(/\w+/i) {|m| c = counts[m] and c.inc}
>> end
>
> But I am sure calling the block for every WORD of a text and having a
> hash lookup for each word, is quite inefficient for larger texts. What
> about:
>
>
array1.inject(Hash.new(0)){|c,t|$re||=/\b(#{a_countries.join('|')})\b/;t.s
can($re){|x|c[x]+=1};c}
>
> Note especially, the one line. ;)

Yes, I've thought of that, too. In fact, you can create a more optimal
regexp. Here's a simplified variant of what I usually use (the more
verbose version optimizes away unnecessary regexp groupings):

class Trie
attr_reader :root

def initialize()
create = lambda {|h,k| h[k]=Hash.new(&create)}
@root = Hash.new(&create)
end

def add(str)
node = root
str.each_byte do |char|
node = node[char]
end
self
end

def to_rx() Regexp.new(to_rx_str) end

def to_rx_str(node = root)
node.map {|k,v| "" << k << "(?:" << to_rx_str(v) << ")" }.join "|"
end
end

>> array1 = ["Joe lives in France", "Sue goes to Japan", "Hiro walked
around France", "Bill moved to England"]
=> ["Joe lives in France", "Sue goes to Japan", "Hiro walked around
France", "Bill moved to England"]
>> a_countries = ["France", "Japan", "England", "Spain"]
=> ["France", "Japan", "England", "Spain"]
>> rx = a_countries.inject(Trie.new){|tr,c|tr.add c}.to_rx
=>
/S(?:p(?:a(?:i(?:n(?:)))))|J(?:a(?:p(?:a(?:n(?:)))))|E(?:n(?:g(?:l(?:a(?:n
(?:d(?:)))))))|F(?:r(?:a(?:n(?:c(?:e(?:))))))/
>> counts = array1.inject(Hash.new(0)){|h,text| text.scan(rx){|m|
h[m]+=1}; h}
=> {"France"=>2, "Japan"=>1, "England"=>1}

:-)

Kind regards

robert

Robert Klemme

6/25/2005 3:49:00 PM

Robert Klemme <bob.news@gmx.net> wrote:
> Florian Frank wrote:
>> Robert Klemme wrote:
>>
>>>
>>> "Florian Frank" <flori@nixe.ping.de> schrieb im Newsbeitrag
>>> news:42B8D1A3.4040005@nixe.ping.de...
>>>
>>>> Charles L. Snyder wrote:
>>>>
>>>>> I have a text-parsing question -
>>>>>
>>>>> array1 = ["Joe lives in France", "Sue goes to Japan", "Hiro walked
>>>>> around
>>>>> France", "Bill moved to England"]
>>>>> a_countries = ["France", "Japan", "England", "Spain"]
>>>>>
>>>> puts a_countries.inject({}) {|h,c| h[c] = array1.inject(0) { |s,a|
>>>> s + a.scan(c).size }; h }.sort_by { |_,v| -v }.map { |a|
>>>> "%10s%10u" % a }
>>>
>>>
>>> If I'm not mistaken your variant iterates array1 a_countries.size
>>> number of times. For large arrays this might be quite inefficient.
>>> This is what I'd do:
>>>
>>> counts = a_countries.inject({}){|h,c| h[c]=0; h}
>>> array1.each do |text|
>>> text.scan(/\w+/i) {|m| c = counts[m] and counts[m] = c+1}
>>> end
>>
>> But, but,... you used more than one line!? Isn't this against the
>> rules?
>
> Well... I can squeeze it on one line. :-)
>
>>> You could even optimize away one hash lookups per found word by
>>> doing:
>>>
>>> class Counter
>>> def initialize(n=0) @n=n end
>>> def inc() @n+=1 end
>>> def to_int() @n end
>>> end
>>
>> Without benchmarking (yeah) it's difficult to say, if this is really
>> faster.
>>
>>> counts = a_countries.inject({}){|h,c| h[c]=Counter.new; h}
>>> array1.each do |text|
>>> text.scan(/\w+/i) {|m| c = counts[m] and c.inc}
>>> end
>>
>> But I am sure calling the block for every WORD of a text and having a
>> hash lookup for each word, is quite inefficient for larger texts.
>> What about:
>>
>>
> array1.inject(Hash.new(0)){|c,t|$re||=/\b(#{a_countries.join('|')})\b/;t.s
> can($re){|x|c[x]+=1};c}
>>
>> Note especially, the one line. ;)
>
> Yes, I've thought of that, too. In fact, you can create a more
> optimal regexp. Here's a simplified variant of what I usually use
> (the more verbose version optimizes away unnecessary regexp
> groupings):
>
> class Trie
> attr_reader :root
>
> def initialize()
> create = lambda {|h,k| h[k]=Hash.new(&create)}
> @root = Hash.new(&create)
> end
>
> def add(str)
> node = root
> str.each_byte do |char|
> node = node[char]
> end
> self
> end
>
> def to_rx() Regexp.new(to_rx_str) end
>
> def to_rx_str(node = root)
> node.map {|k,v| "" << k << "(?:" << to_rx_str(v) << ")" }.join "|"
> end
> end
>
>>> array1 = ["Joe lives in France", "Sue goes to Japan", "Hiro walked
> around France", "Bill moved to England"]
> => ["Joe lives in France", "Sue goes to Japan", "Hiro walked around
> France", "Bill moved to England"]
>>> a_countries = ["France", "Japan", "England", "Spain"]
> => ["France", "Japan", "England", "Spain"]
>>> rx = a_countries.inject(Trie.new){|tr,c|tr.add c}.to_rx
> =>
> /S(?:p(?:a(?:i(?:n(?:)))))|J(?:a(?:p(?:a(?:n(?:)))))|E(?:n(?:g(?:l(?:a(?:n
> (?:d(?:)))))))|F(?:r(?:a(?:n(?:c(?:e(?:))))))/
>>> counts = array1.inject(Hash.new(0)){|h,text| text.scan(rx){|m|
> h[m]+=1}; h}
> => {"France"=>2, "Japan"=>1, "England"=>1}
>
> :-)
>
> Kind regards
>
> robert

Here's a more optimized version

class Trie
attr_reader :root

def initialize()
create = lambda {|h,k| h[k]=Hash.new(&create)}
@root = Hash.new(&create)
end

def add(str)
node = root
str.each_byte do |char|
node = node[char]
end
self
end

def to_rx(*rx_options) Regexp.new(to_rx_str, *rx_options) end

def to_rx_str(node = root)
case node.size
when 0
""
when 1
"" << node.keys[0] << to_rx_str(node.values[0])
else
sub_rx = node.map {|k,v| "" << k << to_rx_str(v) }.join( "|" )
if node.equal? root then sub_rx else "(?:" << sub_rx << ")" end
end
end
end

?> array1 = ["Joe lives in France", "Sue goes to Japan", "Hiro walked around
France", "Bill moved to England"]
=> ["Joe lives in France", "Sue goes to Japan", "Hiro walked around France",
"Bill moved to England"]
>> a_countries = ["France", "Japan", "England", "Spain", "Ecuador"]
=> ["France", "Japan", "England", "Spain", "Ecuador"]
>> rx = a_countries.inject(Trie.new){|tr,c|tr.add c}.to_rx
=> /Spain|Japan|E(?:cuador|ngland)|France/

Cheers

robert

comp.lang.ruby

Histogram-type Data

Charles L. Snyder

James Gray

Florian Frank

Robert Klemme

Florian Frank

Robert Klemme

Robert Klemme

x Login to ForumsZone