[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Problem with String encoding when modifying it in C method

Iñaki Baz Castillo

4/3/2009 4:40:00 PM

Hi, I've added a method "multi_capitalize" to String class. This
method is done in C and basically modifies the string:

"record-roUTE".multi_capitalize =3D> "Record-Route"

The problem is that after the method execution, the new String has
ASCII-8BIT encoding, while the original string had UTF-8 (using Ruby
1.9.1).

---------------------------------------------------------------------------=
-----
irb> hname =3D "record-rouTE-=E2=82=AC"
"record-rouTE-=E2=82=AC"

irb> hname.encoding
#<Encoding:UTF-8>

irb> hname2 =3D hname.multi_capitalize
"Record-Route-\xE2\x82\xAC" <------- !!!

irb> hname2.encoding
#<Encoding:ASCII-8BIT> <------- !!!

irb> hname2.force_encoding("utf-8")
"Record-Route-=E2=82=AC"

irb> hname2.encoding
#<Encoding:UTF-8>
---------------------------------------------------------------------------=
-----

What should I add to my C method to mantain the UTF-8 codification
after the changes in the string?
Could I invoke the C "force_encoding()" function from the C code
before returning the modified string? How to invoke it?

Thanks a lot.


--=20
I=C3=B1aki Baz Castillo
<ibc@aliax.net>

5 Answers

Andre Nathan

4/3/2009 6:18:00 PM

0

On Sat, 2009-04-04 at 01:39 +0900, Iñaki Baz Castillo wrote:
> Could I invoke the C "force_encoding()" function from the C code
> before returning the modified string? How to invoke it?

You can call it as (untested):

rb_funcall(str, rb_intern("force_encoding"), 1, rb_str_new2("utf-8"));

I'm not sure how to make your multi-capitalize method do the right
thing, but maybe reading the source of rb_str_capitalize_bang in
string.c helps.

Best,
Andre


Iñaki Baz Castillo

4/3/2009 6:34:00 PM

0

El Viernes 03 Abril 2009, Andre Nathan escribi=C3=B3:
> On Sat, 2009-04-04 at 01:39 +0900, I=C3=B1aki Baz Castillo wrote:
> > Could I invoke the C "force_encoding()" function from the C code
> > before returning the modified string? How to invoke it?
>
> You can call it as (untested):
>
> rb_funcall(str, rb_intern("force_encoding"), 1, rb_str_new2("utf-8"));
>
> I'm not sure how to make your multi-capitalize method do the right
> thing, but maybe reading the source of rb_str_capitalize_bang in
> string.c helps.

Thanks a lot, I will check it.

=2D-=20
I=C3=B1aki Baz Castillo <ibc@aliax.net>

Iñaki Baz Castillo

4/3/2009 7:01:00 PM

0

El Viernes 03 Abril 2009, I=C3=B1aki Baz Castillo escribi=C3=B3:
> El Viernes 03 Abril 2009, Andre Nathan escribi=C3=B3:
> > On Sat, 2009-04-04 at 01:39 +0900, I=C3=B1aki Baz Castillo wrote:
> > > Could I invoke the C "force_encoding()" function from the C code
> > > before returning the modified string? How to invoke it?
> >
> > You can call it as (untested):
> >
> > rb_funcall(str, rb_intern("force_encoding"), 1, rb_str_new2("utf-8"));
> >
> > I'm not sure how to make your multi-capitalize method do the right
> > thing, but maybe reading the source of rb_str_capitalize_bang in
> > string.c helps.
>
> Thanks a lot, I will check it.

Yes, rb_str_capitralize_bang handles a lot of stuf realted to encoding:

c =3D rb_enc_codepoint(s, send, enc);
if (rb_enc_islower(c, enc)) {
rb_enc_mbcput(rb_enc_toupper(c, enc), s, enc);
modify =3D 1;
}
s +=3D rb_enc_codelen(c, enc);

so this is the way :)

Thanks a lot.

=2D-=20
I=C3=B1aki Baz Castillo <ibc@aliax.net>

KUBO Takehiro

4/4/2009 10:33:00 AM

0

Hi,

On Sat, Apr 4, 2009 at 1:39 AM, I=F1aki Baz Castillo <ibc@aliax.net> wrote:
> Hi, I've added a method "multi_capitalize" to String class. This
> method is done in C and basically modifies the string:
>
> =A0"record-roUTE".multi_capitalize =3D> "Record-Route"
>
> The problem is that after the method execution, the new String has
> ASCII-8BIT encoding, while the original string had UTF-8 (using Ruby
> 1.9.1).

rb_encoding *enc =3D rb_enc_get(original_string)

/* create a new string with the encoding same with the original string =
*/
return rb_enc_str_new(char_pointer, length, enc);

rb_str_new() makes a ASCII-8BIT string.

Iñaki Baz Castillo

4/4/2009 10:39:00 AM

0

El S=E1bado 04 Abril 2009, KUBO Takehiro escribi=F3:
> Hi,
>
> On Sat, Apr 4, 2009 at 1:39 AM, I=F1aki Baz Castillo <ibc@aliax.net> wrot=
e:
> > Hi, I've added a method "multi_capitalize" to String class. This
> > method is done in C and basically modifies the string:
> >
> > "record-roUTE".multi_capitalize =3D> "Record-Route"
> >
> > The problem is that after the method execution, the new String has
> > ASCII-8BIT encoding, while the original string had UTF-8 (using Ruby
> > 1.9.1).
>
> rb_encoding *enc =3D rb_enc_get(original_string)
>
> /* create a new string with the encoding same with the original string
> */ return rb_enc_str_new(char_pointer, length, enc);
>
> rb_str_new() makes a ASCII-8BIT string.

Thanks.

=2D-=20
I=F1aki Baz Castillo <ibc@aliax.net>