Asp Forum - regexp help sought

rpardee

2/24/2005 4:46:00 PM

Hey All,

I'm trying to parse lines from my text editor's config file, which look
like this (pls watch for line wrap--there is one line per language,
starting with /L<<digit>>):

/L1"SAS" Line Comment = * Block Comment On = /* Block Comment Off = */
Block Comment On Alt = * Block Comment Off Alt = ; Nocase File
Extensions = SAS
/L2"Visual Basic" Line Comment = ' File Extensions = BAS FRM CLS VBS
CTL WSF
/L4"HTML" Nocase Noquote HTML_LANG Block Comment On =  Block Comment On Alt = <% Block Comment Off Alt = %>
String Chars = "' File Extensions = HTM HTML ASP SHTML HTT HTX JSP
/L11"Ruby" Line Comment Num = 2# Block Comment On = =begin Block
Comment Off = =end String Chars='" Escape Char = \ File Extensions = RB
RBW

I'm trying to write a method for extracting the comment markers & their
types (line/block & on/off). Regexps seemed the obvious tool, and I
eventually came up with this one:

c = Regexp.new("(Line|Block) Comment (On |Off |On Alt |Off Alt)*=
([^\s\t\r\n\f]+) ")

This is working well so far, except that it only grabs out the first
type of comment in each line. I'd hoped that I could make it get all
the comment types by putting an additional set of parens and a +
quantifier around the whole expression:

c = Regexp.new("((Line|Block) Comment (On |Off |On Alt |Off Alt)*=
([^\s\t\r\n\f]+))+ ")

But that just seems to break it--that version doesn't capture anything.

Anybody got a clue for me? I'm using v1.8 on windows. My code is
below. (And again, pls watch for line wrapping).

Thanks!

-Roy

def parse_comment_markers(line)
=begin

There are line comments & (2 different kinds of) block comments.

Line comments only have a start marker--EOL is the terminator.

Comment types are:
Line Comment = <>
Block Comment On = <>
Block Comment Off = <>
Block Comment On Alt = <>
Block Comment Off Alt = <>

Where <> can be any contiguous set of non-whitespace chars.

For Line comment marks, preceding digits specify the # of spaces
minus 1
required after the nondigit portion of the marker. So for ruby, the
line
comment mark is 2#, signifying that # is a comment only if it is
followed by
a space. Ignore this for now.

So--funky regexp time. We want to grab sequences centered around
the string " Comment ".
We want the single word prior to "Comment" and all words between
"Comment" and " = ", and
then of course the contiguous nonwhitespace following " = ".

=end
puts line
# Why doesn't the \S char class work?
# c = Regexp.new("(Line|Block) Comment (On |Off |On Alt |Off Alt)*=
(\S+)")
c = Regexp.new("(Line|Block) Comment (On |Off |On Alt |Off Alt)*=
([^\s\t\r\n\f]+) ")
cm = c.match(line)
if cm.nil?
puts "No match!"
else
puts cm.captures.join(" || ")
puts "Comment type is \"" + cm.captures[0] + "\", and comment
marker is \"" + cm.captures[2] + "\""
end
end

parse_comment_markers("/L2 \"Ruby\" Line Comment = # Block Comment On =
' File Extensions = RB RBW")
parse_comment_markers("/L2 \"Ruby\" Block Comment On = =begin Block
Comment Off = =end File Extensions = RB RBW")

2 Answers

Robert Klemme

2/24/2005 4:59:00 PM

<rpardee@comcast.net> schrieb im Newsbeitrag
news:1109263535.280360.203760@g14g2000cwa.googlegroups.com...
> Hey All,
>
> I'm trying to parse lines from my text editor's config file, which look
> like this (pls watch for line wrap--there is one line per language,
> starting with /L<<digit>>):
>
> /L1"SAS" Line Comment = * Block Comment On = /* Block Comment Off = */
> Block Comment On Alt = * Block Comment Off Alt = ; Nocase File
> Extensions = SAS
> /L2"Visual Basic" Line Comment = ' File Extensions = BAS FRM CLS VBS
> CTL WSF
> /L4"HTML" Nocase Noquote HTML_LANG Block Comment On =  Block Comment On Alt = <% Block Comment Off Alt = %>
> String Chars = "' File Extensions = HTM HTML ASP SHTML HTT HTX JSP
> /L11"Ruby" Line Comment Num = 2# Block Comment On = =begin Block
> Comment Off = =end String Chars='" Escape Char = \ File Extensions = RB
> RBW
>
> I'm trying to write a method for extracting the comment markers & their
> types (line/block & on/off). Regexps seemed the obvious tool, and I
> eventually came up with this one:
>
> c = Regexp.new("(Line|Block) Comment (On |Off |On Alt |Off Alt)*=
> ([^\s\t\r\n\f]+) ")
>
> This is working well so far, except that it only grabs out the first
> type of comment in each line. I'd hoped that I could make it get all
> the comment types by putting an additional set of parens and a +
> quantifier around the whole expression:
>
> c = Regexp.new("((Line|Block) Comment (On |Off |On Alt |Off Alt)*=
> ([^\s\t\r\n\f]+))+ ")
>
> But that just seems to break it--that version doesn't capture anything.
>
> Anybody got a clue for me? I'm using v1.8 on windows. My code is
> below. (And again, pls watch for line wrapping).

You want String#scan

matches = line.scan(re)

or

line.scan(re) do |match|
....
end

Kind regards

robert

rpardee@gmail.com

2/25/2005 2:43:00 PM

Awesome--that's exactly what I needed. And much more readable than
elaborating the regexp.

Thanks!

-Roy

comp.lang.ruby

regexp help sought

rpardee

Robert Klemme

rpardee@gmail.com

x Login to ForumsZone