Robert Klemme
6/12/2007 11:01:00 AM
On 11.06.2007 22:20, Erwin Abbott wrote:
> On 6/10/07, Robert Klemme <shortcutter@googlemail.com> wrote:
>> On 10.06.2007 18:25, Erwin Abbott wrote:
>> > I ended up refactoring, but earlier I was parsing some text by
>> > associating an array of attributes (like, [/a/, /b/, /c$/] that might
>> > match the first 3 words) with a block that processed the matching
>> > text, and then moved the position in the string forward by 3 words. I
>> > tried wanted to be able to do this like:
>>
>> Did I understand that properly, you want to process three words at a
>> time and then the next three words? Then you could do...
>
> I already have solved the problem, but maybe someone will find this
> useful in the future. Basically we start with position=0 (position is
> the index of the words array). Each "match" is tried until one
> succeeds, and the position is incremented by the number of words it
> operated on. So if I called match.call(/the/, /quick/, /brown/, /.*/,
> /e$/), it would read 5 words starting at "position" and if all the
> arguments matched the words, it would process the 5 words in some way
> and then increment the position by 5.
>
> In my application I'm not really using regexs though, my words are
> tokens with various tags, and I'm matching based on the tags. This is
> all being used to pase date strings like "Wed Aug 5th 2008" might be
> matched by a rule like match.call(:weekday, :month, :ordinal, :year)
> for example. Then there might be another rule like match.call(:num,
> :num, :year) that would match "05 05 2005" and would decide how to
> parse it.
>
>> It's still unclear to me how exactly you want the matching to work. Are
>> all your "attributes" matched against all three words? Do you
>> positional matches? In the code all rx's are matched against words in
>> the same position and if all match the block is invoked on the words.
>
> Basically you have it right, the words have to match their
> /respective/ attribute. But it's not a fixed number of words at a
> time, because match.call(/the/) would only match one word (then
> process it, then increment the position index by one).
>
> Initially (it was late at night, mind you) I though having a closure
> would work nicely because I could access position, words, and some
> other variables in the caller's scope and wouldn't have to pass those
> along every time. But it was too tricky/messy because I also needed to
> restart at the beginning of the loop after a success (to start trying
> all the patterns again), and I needed to know if anything had matched
> (so I could increment position by 1, else have an infinite loop).
>
> What I ended up doing was having a function to store the list of
> attributes and the block that should be called to "process" the
> matching words, and then another function that began scanning the word
> list from position=0, testing all the attributes (like match.call
> would've), and taking care of incrementing the position index the
> right amount. Here's parts of the code:
>
> def self.date_scanner *tags, &block
> @@date_scanners << [tags, block]
> end
>
> def self.setup_date_scanners
> @@date_scanners = []
>
> date_scanner(NLTime::Day, :time, :tz) do |d, t|
> # two timezones were given, like 12:30:00 -0400 (EDT); ignore
> rightmost one
> d.get_tag(NLTime::Day).time(t.get_tag(NLTime::Time))
> end
>
> date_scanner(NLTime::Day, :time) do |d, t|
> d.get_tag(NLTime::Day).time(t.get_tag(NLTime::Time))
> end
>
> date_scanner(:time, NLTime::Day) do |t, d|
> d.get_tag(NLTime::Day).time(t.get_tag(NLTime::Time))
> end
>
> date_scanner(:month, :num, :time, :year) do |m, a, t, y|
> # May 05 12:00:00 -0000 2005
> day = NLTime::Day.civil(a.word, m.word, y.get_tag(NLTime::Year))
> day.time(t.get_tag(NLTime::Time))
> end
>
> date_scanner(:year, :num, :num) do |y, a, b|
> # 2005 05 05
> NLTime::Day.civil(b.word, a.word, y.get_tag(NLTime::Year))
> end
>
> date_scanner(:year, :month, :num) do |y, m, a|
> # 2005 May 05
> NLTime::Day.civil(a.word, m.word, y.get_tag(NLTime::Year))
> end
>
> date_scanner(:month, :num, :year) do |m, a, y|
> # May 05 2005
> NLTime::Day.civil(a.word, m.word, y.get_tag(NLTime::Year))
> end
>
> # ...
> end
>
> def self.scan_dates tokens, order=:dm
> # TODO:
> # order=:dm assume day/month like american format
> # order=:md assume month/day like european format
>
> # processed tokens
> ptokens = []; k = 0
>
> while k < tokens.size
> found = false
>
> @@date_scanners.each do |tags, block|
> if s = tokens[-tags.size-k..-1-k]
> # assume success until one of the tags doesn't match
> found = true
>
> # match tags to tokens
> s.zip(tags).each do |token, tag|
> unless token.has_tag? tag
> # not a match... next scanner, please
> found = false
> break
> end
> end
>
> if found
> # this scanner matches, have the tokens processed
> if date = block.call(*s)
> token = NLTime::Token.new(date.to_s, :entity, date)
> ptokens.unshift token
>
> # increment the position by number of tokens processed
> by the block
> k += tags.size
>
> # don't try to match any more scanners
> break
> else
> # the block failed, try the next scanner
> found = false
> end
>
> end
> end
> end
>
> unless found
> # none of the scanners matched
> ptokens.unshift tokens[-1-k]
> k += 1
> end
> end
>
> ptokens
> end
>
> The scan_dates operates on an array of NLTime::Tokens, which have
> various tags. The tags can be symbols, which basically categorize
> words (like "Jan" would have :month tag), or they can be objects (like
> we might have tagged 2005 with a NLTime::Year object representing the
> year 2005). This should "replace" sequences of tokens that were
> matched by a scanner with a new token, tagged with an instance of
> NLTime::Day or Time.
>
>> I still think you're not yet there.
>
> Well, my code does what I want it to do... so I'm not sure what you mean?
I had the impression that your design or implementation still had room
for improvements. From what you wrote I assume you do some kind of
pattern matching. For the fun of it I coded something that solves a
similar problem. Of course this could be changed in all sorts of ways
(i.e. to store the pattern hash as a variable or remove the enum etc.).
Kind regards
robert
#!ruby
require 'enumerator'
class PatternMatcher
attr_accessor :enum
attr_reader :pos
def initialize(enum)
self.enum = enum
reset
end
def reset
@pos = 0
end
def match(patterns)
section = enum[@pos, patterns.size]
if patterns.to_enum(:zip, section).all? {|pt, x| pt === x}
yield(*section)
@pos += patterns.size
end
end
def match_all(action_hash)
while more?
action_hash.each do |patterns, action|
match(patterns, &action) and break nil
end and raise "Cannot continue to match"
end
end
def rest
enum[@pos .. -1]
end
def more?
@pos < enum.size
end
end
pm = PatternMatcher.new %w{foo bar baz}
pm.match( [/^ba/] ) {|*a| p a}
pm.match( [/^f/, /ba/] ) {|*a| p a}
pm.match( [/b/] ) {|*a| p a}
puts "--"
pm.reset
pm.match_all(
[/^b/] => lambda {|x| puts "1 #{x}"},
[/^f/, /ba/] => lambda {|x,y| puts "2 #{x}, #{y}"},
[/b/] => lambda {|x| puts "1b #{x}"}
)
puts "--"
begin
pm.reset
pm.match_all(
[/^b/] => lambda {|x| puts "1 #{x}"},
[/b/] => lambda {|x| puts "1b #{x}"}
)
rescue Exception => e
p e
end