Asp Forum - Regexp question

Mark Probert

9/30/2004 9:11:00 PM

Hi, Rubyists.

What is the best way of attacking field split on ';' when the string looks
like:

s = 'a;b;c\;;d;'
s.split(/???;/)
=> ["a", "b", "c\;", "d"]

Or is it best to use s.each_byte and do it by hand?

--
-mark. (probertm @ acm dot org)

10 Answers

Simon Strandgaard

9/30/2004 9:29:00 PM

On Thursday 30 September 2004 23:15, Mark Probert wrote:
> Hi, Rubyists.
>
> What is the best way of attacking field split on ';' when the string looks
> like:
>
> s = 'a;b;c\;;d;'
> s.split(/???;/)
> => ["a", "b", "c\;", "d"]
>
> Or is it best to use s.each_byte and do it by hand?

How about something ala

irb(main):015:0> "aa;bbb\\;;abc;;d\\\\;e;".scan(/(?:\\[^.]|[^;])*;/)
=> ["aa;", "bbb\\;;", "abc;", ";", "d\\\\;", "e;"]

--
Simon Strandgaard

Brian Schröder

9/30/2004 9:34:00 PM

Mark Probert wrote:
> Hi, Rubyists.
>
> What is the best way of attacking field split on ';' when the string looks
> like:
>
> s = 'a;b;c\;;d;'
> s.split(/???;/)
> => ["a", "b", "c\;", "d"]
>
> Or is it best to use s.each_byte and do it by hand?
>

Normally this would call for fixed width lookbehind,

/(?<!\\);/

but as far as I know its not included in the ruby regexp engine.

But for further clarification:
How should 'a;b\\;;c' be split?
If backslashs can be escaped (and you'd want that because otherwise you
can't have a field "b\" its more difficult.

And maybe the CSV library can help you here.

regards,

Brian

--
Brian Schröder
http://ruby.brian-sch...

Simon Strandgaard

9/30/2004 9:43:00 PM

On Thursday 30 September 2004 23:29, Simon Strandgaard wrote:
> On Thursday 30 September 2004 23:15, Mark Probert wrote:
> > Hi, Rubyists.
> >
> > What is the best way of attacking field split on ';' when the string
> > looks like:
> >
> > s = 'a;b;c\;;d;'
> > s.split(/???;/)
> > => ["a", "b", "c\;", "d"]
> >
> > Or is it best to use s.each_byte and do it by hand?
>
> How about something ala
>
> irb(main):015:0> "aa;bbb\\;;abc;;d\\\\;e;".scan(/(?:\\[^.]|[^;])*;/)
> => ["aa;", "bbb\\;;", "abc;", ";", "d\\\\;", "e;"]

maybe this one is better ?

irb(main):001:0> "aa;bbb\\;;abc;;d\\\\;e;f".scan(/(?:\A|;)((?:\\[^.]|[^;])*)/)
{ p $1 }
"aa"
"bbb\\;"
"abc"
""
"d\\\\"
"e"
"f"
=> "aa;bbb\\;;abc;;d\\\\;e;f"
irb(main):002:0>

--
Simon Strandgaard

Mark Probert

9/30/2004 9:47:00 PM

Hi ..

Simon Strandgaard <neoneye@adslhome.dk> wrote:
>
> How about something ala
>
> irb(main):015:0> "aa;bbb\\;;abc;;d\\\\;e;".scan(/(?:\\[^.]|[^;])*;/)
> => ["aa;", "bbb\\;;", "abc;", ";", "d\\\\;", "e;"]
>

Thanks! That is close enough:

irb(main):019:0> s.scan(/(?:\\[^.]|[^;])*/).each do |it|
irb(main):020:1* next if it.empty?
irb(main):021:1> puts " --> #{it}"
irb(main):022:1> end
--> a is a word
--> b is too
--> c\; for fun
--> d -- forget it
=> ["a is a word", "", "b is too", "", "c\\; for fun", "", "d -- forget
it", "", ""]

--
-mark. (probertm @ acm dot org)

Dany Cayouette

9/30/2004 9:57:00 PM

> But for further clarification:
> How should 'a;b\\;;c' be split?
Guess is that it should be
["a", "b\", nil, "c"]

characters escaped by backslash at semi-colon, colon and backslash i.e.

; => \; : => \: \ => \
> If backslashs can be escaped (and you'd want that because otherwise you
> can't have a field "b\" its more difficult.
>
> And maybe the CSV library can help you here.

thanks,
Dany

Dany Cayouette

9/30/2004 10:11:00 PM

On Thu, 30 Sep 2004 17:57:19 -0400
Dany Cayouette <danyc@nortelnetworks.com> wrote:

>
> > But for further clarification:
> > How should 'a;b\\;;c' be split?
> Guess is that it should be
> ["a", "b\", nil, "c"]
Sorry... I meant
["a", "b\\", nil, "c"] where b\\ would utimately become b\ when the escape chars are process in the data portion
>
> characters escaped by backslash at semi-colon, colon and backslash i.e.
>
> ; => \; : => \: \ => \>
> > If backslashs can be escaped (and you'd want that because otherwise you
> > can't have a field "b\" its more difficult.
> >
Didn't think about that one... I thought this was simple and the problem was my lack of programming experience...

Dany

Florian Gross

9/30/2004 11:09:00 PM

Mark Probert wrote:

> Hi, Rubyists.

Moin!

> What is the best way of attacking field split on ';' when the string looks
> like:
>
> s = 'a;b;c\;;d;'
> s.split(/???;/)
> => ["a", "b", "c\;", "d"]
>
> Or is it best to use s.each_byte and do it by hand?

This works, (even with escaped escape characters) but you might be
better off doing it by hand to keep complexity low:

> irb(main):025:0> str = "hello;world;foo\\;bar;no escape\\\\;blar"; puts str
> hello;world;foo\;bar;no escape\\;blar
> => nil
> irb(main):026:0> str.scan(/(?:(?!\\).(?:\\{2})*\\;|[^;])+/).map { |str| str.gsub(/\\(.)/, '\1') }
> => ["hello", "world", "foo;bar", "no escape\\", "blar"]

Regards,
Florian Gross

Robert Klemme

10/1/2004 7:45:00 AM

"Mark Probert" <probertm@nospam-acm.org> schrieb im Newsbeitrag
news:Xns95749654816D0probertmnospamtelusn@198.161.157.145...
> Hi ..
>
> Simon Strandgaard <neoneye@adslhome.dk> wrote:
> >
> > How about something ala
> >
> > irb(main):015:0> "aa;bbb\\;;abc;;d\\\\;e;".scan(/(?:\\[^.]|[^;])*;/)
> > => ["aa;", "bbb\\;;", "abc;", ";", "d\\\\;", "e;"]
> >
>
> Thanks! That is close enough:
>
> irb(main):019:0> s.scan(/(?:\\[^.]|[^;])*/).each do |it|
> irb(main):020:1* next if it.empty?
> irb(main):021:1> puts " --> #{it}"
> irb(main):022:1> end
> --> a is a word
> --> b is too
> --> c\; for fun
> --> d -- forget it
> => ["a is a word", "", "b is too", "", "c\\; for fun", "", "d -- forget
> it", "", ""]

>> s = "aa;bbb\\;;abc;;d\\\\;e;"
=> "aa;bbb\\;;abc;;d\\\\;e;"
>> s.scan /(?:\\.|[^\\;])+/
=> ["aa", "bbb\\;", "abc", "d\\\\", "e"]

Regards

robert

Simon Strandgaard

10/1/2004 4:33:00 PM

On Friday 01 October 2004 09:45, Robert Klemme wrote:
[snip]
> >> s = "aa;bbb\\;;abc;;d\\\\;e;"
> => "aa;bbb\\;;abc;;d\\\\;e;"
> >> s.scan /(?:\\.|[^\\;])+/
> => ["aa", "bbb\\;", "abc", "d\\\\", "e"]

If its a csv file.. shouldn't output then be?

["aa", "bbb\\;", "abc", "", "d\\\\", "e", ""]

--
Simon Strandgaard

Robert Klemme

10/1/2004 9:42:00 PM

"Simon Strandgaard" <neoneye@adslhome.dk> schrieb im Newsbeitrag
news:200410012022.59526.neoneye@adslhome.dk...
> On Friday 01 October 2004 09:45, Robert Klemme wrote:
> [snip]
>> >> s = "aa;bbb\\;;abc;;d\\\\;e;"
>> => "aa;bbb\\;;abc;;d\\\\;e;"
>> >> s.scan /(?:\\.|[^\\;])+/
>> => ["aa", "bbb\\;", "abc", "d\\\\", "e"]
>
>
> If its a csv file.. shouldn't output then be?
>
> ["aa", "bbb\\;", "abc", "", "d\\\\", "e", ""]

Darn! You're right. Unfortunately using "*" instead of "+" is not
sufficient: far too many empty strings are found that way.

robert

comp.lang.ruby

Regexp question

Mark Probert

Simon Strandgaard

Brian Schröder

Simon Strandgaard

Mark Probert

Dany Cayouette

Dany Cayouette

Florian Gross

Robert Klemme

Simon Strandgaard

Robert Klemme

x Login to ForumsZone