Asp Forum - How to split(//) with respect to bigraphs?

Pavel Smerk

8/2/2006 5:42:00 PM

And once more question:

In Czech, c followed by h is considered (for sorting etc.) as one
character/grapheme ch. I need to split string to single characters with
respect to this absurd manner.

In Perl I can write

split /(?<=(?![Cc][Hh]).)/, $string

and it works fine.

Unfortunately, Ruby does not implement/support this "zero-width positive
look-behind assertion", so the question is how can one efficiently split
the string in Ruby?

Thanks,

P.

9 Answers

Justin Collins

8/2/2006 6:04:00 PM

Pavel Smerk wrote:
> And once more question:
>
> In Czech, c followed by h is considered (for sorting etc.) as one
> character/grapheme ch. I need to split string to single characters
> with respect to this absurd manner.
>
> In Perl I can write
>
> split /(?<=(?![Cc][Hh]).)/, $string
>
> and it works fine.
>
> Unfortunately, Ruby does not implement/support this "zero-width
> positive look-behind assertion", so the question is how can one
> efficiently split the string in Ruby?
>
> Thanks,
>
> P.
>
Does this work?

irb(main):001:0> "czech".split(/([Cc][Hh])|/)
=> ["c", "z", "e", "ch"]
irb(main):002:0> "check czech".split(/([Cc][Hh])|/)
=> ["", "ch", "e", "c", "k", " ", "c", "z", "e", "ch"]
irb(main):003:0> "cHeck czeCh".split(/([Cc][Hh])|/)
=> ["", "cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

-Justin

Paul Battley

8/2/2006 6:28:00 PM

On 02/08/06, Justin Collins <collinsj@seattleu.edu> wrote:
> irb(main):001:0> "czech".split(/([Cc][Hh])|/)
> => ["c", "z", "e", "ch"]
> irb(main):002:0> "check czech".split(/([Cc][Hh])|/)
> => ["", "ch", "e", "c", "k", " ", "c", "z", "e", "ch"]
> irb(main):003:0> "cHeck czeCh".split(/([Cc][Hh])|/)
> => ["", "cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

Or use scan:

str.scan(/(?:ch)|./i)

You might still have a problem with other characters, though,
depending on the encoding and normalisation.

Paul.

Pavel Smerk

8/2/2006 6:49:00 PM

Justin Collins wrote:
> Pavel Smerk wrote:
>
>> And once more question:

one more :)

>> In Czech, c followed by h is considered (for sorting etc.) as one
>> character/grapheme ch. I need to split string to single characters
>> with respect to this absurd manner.
>>
>> In Perl I can write
>>
>> split /(?<=(?![Cc][Hh]).)/, $string
>>
>> and it works fine.
>>
>> Unfortunately, Ruby does not implement/support this "zero-width
>> positive look-behind assertion", so the question is how can one
>> efficiently split the string in Ruby?

Stupid question. :-) One should not insist on word-for-word translation
when rewriting some code from Perl to Ruby. :-)

The solution can be e.g. scan(/[cC][hH]|./)

irb(main):001:0> "cHeck czeCh".scan(/[cC][hH]|./)
=> ["cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

> Does this work?
>
> irb(main):001:0> "czech".split(/([Cc][Hh])|/)
> => ["c", "z", "e", "ch"]
> irb(main):002:0> "check czech".split(/([Cc][Hh])|/)
> => ["", "ch", "e", "c", "k", " ", "c", "z", "e", "ch"]
> irb(main):003:0> "cHeck czeCh".split(/([Cc][Hh])|/)
> => ["", "cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]

Scan version is slightly better as it never returns the empty string. Of
course, thanks anyway.

But where can one find this feature of the split in the documentation?
http://www.rubycentral.com/ref/ref_c_string.... does not mention
split returns not only delimited substrings, but also successful groups
from the match of the regexp.

Regards,

P.

Christian Neukirchen

8/2/2006 6:49:00 PM

Pavel Smerk <smerk@fi.muni.cz> writes:

> And once more question:
>
> In Czech, c followed by h is considered (for sorting etc.) as one
> character/grapheme ch. I need to split string to single characters
> with respect to this absurd manner.
>
> In Perl I can write
>
> split /(?<=(?![Cc][Hh]).)/, $string

string.split(/ch|./i)

--
Christian Neukirchen <chneukirchen@gmail.com> http://chneuk...

Pavel Smerk

8/2/2006 6:54:00 PM

Paul Battley wrote:
> On 02/08/06, Justin Collins <collinsj@seattleu.edu> wrote:
>
>> irb(main):001:0> "czech".split(/([Cc][Hh])|/)
>> => ["c", "z", "e", "ch"]
>> irb(main):002:0> "check czech".split(/([Cc][Hh])|/)
>> => ["", "ch", "e", "c", "k", " ", "c", "z", "e", "ch"]
>> irb(main):003:0> "cHeck czeCh".split(/([Cc][Hh])|/)
>> => ["", "cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]
>
>
> Or use scan:
>
> str.scan(/(?:ch)|./i)

Yes, the use of scan strikes me in the meantime too. Why (?:)?
str.scan(/ch|./i) does exactly the same, doesn't it?

Thank you,

P.

Paul Battley

8/2/2006 7:10:00 PM

On 02/08/06, Pavel Smerk <smerk@fi.muni.cz> wrote:
> Yes, the use of scan strikes me in the meantime too. Why (?:)?
> str.scan(/ch|./i) does exactly the same, doesn't it?

Yeah, there's no need for the (?: ... ). I started off thinking it was
more complicated than it was, and forgot to take that out. I really
need a regexp refactoring tool.

Paul.

Justin Collins

8/2/2006 7:21:00 PM

Pavel Smerk wrote:
> Justin Collins wrote:
>> Pavel Smerk wrote:
>>
>>> And once more question:
>
> one more :)
>
>>> In Czech, c followed by h is considered (for sorting etc.) as one
>>> character/grapheme ch. I need to split string to single characters
>>> with respect to this absurd manner.
>>>
>>> In Perl I can write
>>>
>>> split /(?<=(?![Cc][Hh]).)/, $string
>>>
>>> and it works fine.
>>>
>>> Unfortunately, Ruby does not implement/support this "zero-width
>>> positive look-behind assertion", so the question is how can one
>>> efficiently split the string in Ruby?
>
> Stupid question. :-) One should not insist on word-for-word
> translation when rewriting some code from Perl to Ruby. :-)
>
> The solution can be e.g. scan(/[cC][hH]|./)
>
> irb(main):001:0> "cHeck czeCh".scan(/[cC][hH]|./)
> => ["cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]
>
>> Does this work?
>>
>> irb(main):001:0> "czech".split(/([Cc][Hh])|/)
>> => ["c", "z", "e", "ch"]
>> irb(main):002:0> "check czech".split(/([Cc][Hh])|/)
>> => ["", "ch", "e", "c", "k", " ", "c", "z", "e", "ch"]
>> irb(main):003:0> "cHeck czeCh".split(/([Cc][Hh])|/)
>> => ["", "cH", "e", "c", "k", " ", "c", "z", "e", "Ch"]
>
> Scan version is slightly better as it never returns the empty string.
> Of course, thanks anyway.
>
> But where can one find this feature of the split in the documentation?
> http://www.rubycentral.com/ref/ref_c_string.... does not
> mention split returns not only delimited substrings, but also
> successful groups from the match of the regexp.
>
> Regards,
>
> P.
>

As far as I can see, it's not in the documentation. I found it by
accident. But, yes, the scan method is better. :)

-Justin

Dave Howell

8/2/2006 9:43:00 PM

On Aug 2, 2006, at 12:21, Justin Collins wrote:

> Pavel Smerk wrote:
>>
>> But where can one find this feature of the split in the
>> documentation? http://www.rubycentral.com/ref/ref_c_string....
>> does not mention split returns not only delimited substrings, but
>> also successful groups from the match of the regexp.
>>
>> Regards,
>>
>> P.
>>
>
> As far as I can see, it's not in the documentation. I found it by
> accident. But, yes, the scan method is better. :)

Oh, my gosh. If only you'd posted this little tidbit two days ago, I'd
have saved a couple hours of code-wrangling.

For sorting purposes, I needed to turn something like
one-and.two@three.net
into
net.three@two.one-and

I started with str.split(/[.]|@/), but then I'd lose where the @ went.
I tried turning it into
["one-and", ".", "two", "@", "three", ".", "net"]
so I could .reverse that, but without positive look-behind, I couldn't
find any way to detect the break *after* the dot except with \w, which
would also trigger after the hyphen.

After hours of work, I ended up with something that was not only long
and confusing, involving .collect and an inner search loop and other
stuff, but when I brought it back up to check it for this email
message, I discovered that it didn't even actually work correctly.

And all along, all I needed to do was change
str.split(/[.]|@).reverse.join
into
str.split(/([.]|@)/).reverse.join

Dang. And thanks! :)

Morton Goldberg

8/2/2006 11:28:00 PM

On Aug 2, 2006, at 3:05 PM, Pavel Smerk wrote:
> But where can one find this feature of the split in the
> documentation? http://www.rubycentra...
> ref_c_string.html#split does not mention split returns not only
> delimited substrings, but also successful groups from the match of
> the regexp.

In Dave Thomas' Pickaxe book. Under String#split he writes:

"If pattern is a Regexp, str is divided where the pattern matches.
Whenever the pattern
matches a zero-length string, str is split into individual
characters. If pattern includes
groups, these groups will be included in the returned values."

Then he gives the following example:

"a@1bb@2ccc".split(/@(\d)/) => ["a", "1", "bb", "2", "ccc"]

Regards, Morton

comp.lang.ruby

How to split(//) with respect to bigraphs?

Pavel Smerk

Justin Collins

Paul Battley

Pavel Smerk

Christian Neukirchen

Pavel Smerk

Paul Battley

Justin Collins

Dave Howell

Morton Goldberg

x Login to ForumsZone