[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Oniguruma Part 2

Christian Kaiser

10/14/2004 10:08:00 AM

In addition to the previous bugs, Oniguruma is not compatible to the Ruby
RegEx parser (which in my eyes is much better).

For example:

\b, \B
Match word boundaries and nonword boundaries respectively

In Oniguruma, '\b' matches the "bell" character.

I guess, when Oniguruma really is incorporated in Ruby 1.9, there are a lot
of changes to do...

Christian


7 Answers

ts

10/14/2004 10:12:00 AM

0

>>>>> "C" == Christian Kaiser <bchk@gmx.de> writes:

CIn Oniguruma, '\b' matches the "bell" character.

???

svg% ruby -rjj -e '"xxxaaaxxx aaa".match(/\baaa\b/)'
Regexp /\baaa\b/
0 word-bound
1 exact3 aaa
2 word-bound
3 end
Optimize EXACT_BM : aaa

String <<xxxaaaxxx aaa>> pos=3

0 word-bound xxx|aaaxxx aaa |FAIL

String <<xxxaaaxxx aaa>> pos=10

0 word-bound xaaaxxx |aaa |
1 exact3 xaaaxxx |aaa |
2 word-bound axxx aaa| |
3 end axxx aaa| |
svg%




Guy Decoux


Christian Kaiser

10/14/2004 10:36:00 AM

0

Yes, I just found that out. I saw the code

static int
conv_backslash_value(int c, ScanEnv* env)
{
if (IS_SYNTAX_OP(env->syntax, ONIG_SYN_OP_ESC_CONTROL_CHARS)) {
switch (c) {
case 'n': return '\n';
case 't': return '\t';
case 'r': return '\r';
case 'f': return '\f';
case 'a': return '\007';
case 'b': return '\010';
case 'e': return '\033';
case 'v':
if (IS_SYNTAX_OP2(env->syntax, ONIG_SYN_OP2_ESC_V_VTAB))
return '\v';
break;

default:
break;
}
}
return c;
}

which was very tempting to assume that problem...

Thanks for your help!

BTW: Do you have any idea how to fix the bug from my first mail
("upper=-1...")?

Christian

"ts" <decoux@moulon.inra.fr> schrieb im Newsbeitrag
news:200410141011.i9EABQ821327@moulon.inra.fr...
> >>>>> "C" == Christian Kaiser <bchk@gmx.de> writes:
>
> CIn Oniguruma, '\b' matches the "bell" character.
>
> ???
>
> svg% ruby -rjj -e '"xxxaaaxxx aaa".match(/\baaa\b/)'
> Regexp /\baaa\b/
> 0 word-bound
> 1 exact3 aaa
> 2 word-bound
> 3 end
> Optimize EXACT_BM : aaa
>
> String <<xxxaaaxxx aaa>> pos=3
>
> 0 word-bound xxx|aaaxxx aaa |FAIL
>
> String <<xxxaaaxxx aaa>> pos=10
>
> 0 word-bound xaaaxxx |aaa |
> 1 exact3 xaaaxxx |aaa |
> 2 word-bound axxx aaa| |
> 3 end axxx aaa| |
> svg%
>
>
>
>
> Guy Decoux
>
>


ts

10/14/2004 10:44:00 AM

0

>>>>> "C" == Christian Kaiser <bchk@gmx.de> writes:

C> which was very tempting to assume that problem...

This is this case

svg% ruby -rjj -e '/[\b]/.dump'
Regexp /[\b]/
0 cclass \010 (1)
1 end
Optimize MAP_SEARCH \010 (1)
svg%


C> BTW: Do you have any idea how to fix the bug from my first mail
C> ("upper=-1...")?

I have, but the author of Oniguruma will give you a better patch.

I was not able to reproduce your first problem (RegEx strings > 16 KB)
with 1.9


Guy Decoux


Markus

10/14/2004 4:08:00 PM

0



IIRC, \010 if backspace, not bell. That would make 'a' (attention?) the
bell.

On Thu, 2004-10-14 at 03:39, Christian Kaiser wrote:
> Yes, I just found that out. I saw the code
>
> static int
> conv_backslash_value(int c, ScanEnv* env)
> {
> if (IS_SYNTAX_OP(env->syntax, ONIG_SYN_OP_ESC_CONTROL_CHARS)) {
> switch (c) {
> case 'n': return '\n';
> case 't': return '\t';
> case 'r': return '\r';
> case 'f': return '\f';
> case 'a': return '\007';
> case 'b': return '\010';
> case 'e': return '\033';
> case 'v':
> if (IS_SYNTAX_OP2(env->syntax, ONIG_SYN_OP2_ESC_V_VTAB))
> return '\v';
> break;
>
> default:
> break;
> }
> }
> return c;
> }
>
> which was very tempting to assume that problem...
>
> Thanks for your help!
>
> BTW: Do you have any idea how to fix the bug from my first mail
> ("upper=-1...")?
>
> Christian
>
> "ts" <decoux@moulon.inra.fr> schrieb im Newsbeitrag
> news:200410141011.i9EABQ821327@moulon.inra.fr...
> > >>>>> "C" == Christian Kaiser <bchk@gmx.de> writes:
> >
> > CIn Oniguruma, '\b' matches the "bell" character.
> >
> > ???
> >
> > svg% ruby -rjj -e '"xxxaaaxxx aaa".match(/\baaa\b/)'
> > Regexp /\baaa\b/
> > 0 word-bound
> > 1 exact3 aaa
> > 2 word-bound
> > 3 end
> > Optimize EXACT_BM : aaa
> >
> > String <<xxxaaaxxx aaa>> pos=3
> >
> > 0 word-bound xxx|aaaxxx aaa |FAIL
> >
> > String <<xxxaaaxxx aaa>> pos=10
> >
> > 0 word-bound xaaaxxx |aaa |
> > 1 exact3 xaaaxxx |aaa |
> > 2 word-bound axxx aaa| |
> > 3 end axxx aaa| |
> > svg%
> >
> >
> >
> >
> > Guy Decoux
> >
> >
>
>



Christian Kaiser

10/15/2004 7:02:00 AM

0

> IIRC, \010 if backspace, not bell. That would make 'a' (attention?) the
> bell.

Nope.

\x010 is octal, so it's \x08 whis is bell.

Fortunately it is not in RegEx. Whatever that code is for, it's not used for
the escapes.

I had a problem with '\w' only catching codes < 128 (simple 7 bit ASCII),
but now I convert scripts and input data to UTF-8 (and select UTF-8 by
ruby's command line), and this works perfectly. If only this would be
readable. I'd rather have UTF-16 for everything (script and data), but
well...

Christian



Markus

10/15/2004 2:25:00 PM

0

I know how octal works, thanks.

I also know that 8 is backspace, not bell. Bell is 7. Has been
for decades.

-- Markus

Oct Dec Hex Control-key Control Action
NUL 0 0 0 ^@ Null character
SOH 1 1 1 ^A Start of heading, = console interrupt
STX 2 2 2 ^B Start of text, maintenance mode on HP console
ETX 3 3 3 ^C End of text
EOT 4 4 4 ^D End of transmission, not the same as ETB
ENQ 5 5 5 ^E Enquiry, goes with ACK; old HP flow control
ACK 6 6 6 ^F Acknowledge, clears ENQ logon hang
BEL 7 7 7 ^G Bell, rings the bell
BS 10 8 8 ^H Backspace, works on HP terminals/computers
HT 11 9 9 ^I Horizontal tab, move to next tab stop
LF 12 10 a ^J Line Feed
VT 13 11 b ^K Vertical tab
FF 14 12 c ^L Form Feed, page eject
CR 15 13 d ^M Carriage Return
SO 16 14 e ^N Shift Out, alternate character set
SI 17 15 f ^O Shift In, resume default character set
DLE 20 16 10 ^P Data link escape
DC1 21 17 11 ^Q XON, with XOFF to pause listings; &quot;okay to send&quot;
DC2 22 18 12 ^R Device control 2, block-mode flow control
DC3 23 19 13 ^S XOFF, with XON is TERM=18 flow control
DC4 24 20 14 ^T Device control 4
NAK 25 21 15 ^U Negative acknowledge
SYN 26 22 16 ^V Synchronous idle
ETB 27 23 17 ^W End transmission block, not the same as EOT
CAN 30 24 18 ^X Cancel line, MPE echoes !!!
EM 31 25 19 ^Y End of medium, Control-Y interrupt
SUB 32 26 1a ^Z Substitute
ESC 33 27 1b ^[ Escape, next character is not echoed
FS 34 28 1c ^\ File separator
GS 35 29 1d ^] Group separator
RS 36 30 1e ^^ Record separator, block-mode terminator
US 37 31 1f ^_ Unit separator



On Fri, 2004-10-15 at 00:04, Christian Kaiser wrote:
> > IIRC, \010 if backspace, not bell. That would make 'a' (attention?) the
> > bell.
>
> Nope.
>
> \x010 is octal, so it's \x08 whis is bell.
>
> Fortunately it is not in RegEx. Whatever that code is for, it's not used for
> the escapes.
>
> I had a problem with '\w' only catching codes < 128 (simple 7 bit ASCII),
> but now I convert scripts and input data to UTF-8 (and select UTF-8 by
> ruby's command line), and this works perfectly. If only this would be
> readable. I'd rather have UTF-16 for everything (script and data), but
> well...
>
> Christian
>
>
>



Christian Kaiser

10/18/2004 9:24:00 AM

0

Oops, sorry. Didn't want to offend. I also should have known that 8 is BS (I
know for decades, actually).

My mistake.

Christian

"Markus" <markus@reality.com> schrieb im Newsbeitrag
news:1097850303.21256.270.camel@lapdog.reality.com...
> I know how octal works, thanks.
>
> I also know that 8 is backspace, not bell. Bell is 7. Has been
> for decades.