Asp Forum - to_yaml and international characters

h3raLd

10/23/2007 12:45:00 PM

Hello,

I noticed some weird behavior when converting a string containing
international characters to YAML:

irb(main):002:0> 'test òùè'.to_yaml
=> "--- \"test \\x95\\x97\\x8A\"\n"
irb(main):003:0>

....but:

irb(main):001:0> 'test òùè'
=> "test \225\227\212"

Basically, the to_yaml method seems to use some strange hex escape
sequences which do not correspond to ANSI, UTF-8 or windows-1252...
The funny part is that when I load the same string from YAML, it is
displayed correctly in the console. This would be fine, except that
when I tried to save it to a file the international characters are not
displayed properly (or better, they are converted to the corresponding
ANSI/UTF-8 characters). What's going on here? What encoding does
to_yaml use to escape international characters?
According to the docs it should be UTF-8, but apparently it is not.

Ruby version: 1.8.6
OS: Windows XP

Any ideas?

14 Answers

Luis Parravicini

10/23/2007 1:06:00 PM

On 10/23/07, h3raLd <h3rald@gmail.com> wrote:
> I noticed some weird behavior when converting a string containing
> international characters to YAML:
>
> irb(main):002:0> 'test òùè'.to_yaml
> => "--- \"test \\x95\\x97\\x8A\"\n"
> irb(main):003:0>
>
> ...but:
>
> irb(main):001:0> 'test òùè'
> => "test \225\227\212"

\225\227\212 is the same as \x95\x97\x8A, the former in octal, and the
latter in hex.

irb(main):002:0> 0x95.to_s(8)
=> "225"
irb(main):003:0> 0x97.to_s(8)
=> "227"
irb(main):004:0> 0x8a.to_s(8)
=> "212"

Bye

--
Luis Parravicini
http://ktulu.co...

h3raLd

10/23/2007 1:44:00 PM

On Oct 23, 3:05 pm, "Luis Parravicini" <lparr...@gmail.com> wrote:
> On 10/23/07, h3raLd <h3r...@gmail.com> wrote:
>
> > I noticed some weird behavior when converting a string containing
> > international characters to YAML:
>
> > irb(main):002:0> 'test òùè'.to_yaml
> > => "--- \"test \\x95\\x97\\x8A\"\n"
> > irb(main):003:0>
>
> > ...but:
>
> > irb(main):001:0> 'test òùè'
> > => "test \225\227\212"
>
> \225\227\212 is the same as \x95\x97\x8A, the former in octal, and the
> latter in hex.
>
> irb(main):002:0> 0x95.to_s(8)
> => "225"
> irb(main):003:0> 0x97.to_s(8)
> => "227"
> irb(main):004:0> 0x8a.to_s(8)
> => "212"
>
> Bye
>
> --
> Luis Parravicinihttp://ktulu.co...

Thanks a lot, this solves part of the mystery!

I figured out the other half, unfortunately: the reason why I can't
view the characters in ANSI or UTF8 is because I'm inputting from DOS,
which means, unfortunately, "Code Page 437" (http://en.wiki...
wiki/Code_page_437).

Richard Conroy

10/23/2007 1:54:00 PM

On 10/23/07, h3raLd <h3rald@gmail.com> wrote:
> Hello,
>
> I noticed some weird behavior when converting a string containing
> international characters to YAML:
>
> irb(main):002:0> 'test òùè'.to_yaml
> => "--- \"test \\x95\\x97\\x8A\"\n"
> irb(main):003:0>

IIRC the various YAML implementations in each language can choose
to output UTF-8, or unicode-escaped ASCII. I think a YAML implementation
has to be able to read either.

Jamal Bengeloun

10/30/2007 12:28:00 AM

Sorry but I do not get it. Plus I am not sure it is only related to
YAML.

I am working on something similar and the only answers I can relate are
those in Python (such as:
http://www.reportlab.com/i18n/python_unicode_tut...). I mean I
got so far as understanding that:

Ã© gets translated to \202
Ã¨ gets translated to \212
Ã gets translated to \205
Ã§ gets translated to \207
Ã¢ gets translated to \203
Ãª gets translated to \210
Ã® gets translated to \214
Ã´ gets translated to \223
Ã» gets translated to \226
Ã¤ gets translated to \204
Ã« gets translated to \211
Ã¯ gets translated to \213
Ã¶ gets translated to \224
Ã¹ gets translated to \227

But why?

The app I am working on gets its data from different sources (yaml
files, dBaseIV files, MS Access files) and then produces xml files (via
builder).

When using print you get the original character. When using p, you get
the escaped equivalent.

And that's only the start of your problems! When trying to get those
characters into utf-8

Ã© gets translated to \202 that then gets translated to ‚
Ã¨ gets translated to \212 that then gets translated to Š
Ã gets translated to \205 that then gets translated to …
Ã§ gets translated to \207 that then gets translated to ‡
Ã¢ gets translated to \203 that then gets translated to ƒ
Ãª gets translated to \210 that then gets translated to ˆ
Ã® gets translated to \214 that then gets translated to Œ
Ã´ gets translated to \223 that then gets translated to “
Ã» gets translated to \226 that then gets translated to –
Ã¤ gets translated to \204 that then gets translated to „
Ã« gets translated to \211 that then gets translated to ‰
Ã¯ gets translated to \213 that then gets translated to ‹
Ã¶ gets translated to \224 that then gets translated to ”
Ã¹ gets translated to \227 that then gets translated to —

Does someone have an explanation?

Does anyone know how to get those characters into the final xml files?

Any help would be greatly appreciated.

Jamal

Luis Parravicini wrote:
> On 10/23/07, h3raLd <h3rald@gmail.com> wrote:
>> => "test \225\227\212"
> \225\227\212 is the same as \x95\x97\x8A, the former in octal, and the
> latter in hex.
>
> irb(main):002:0> 0x95.to_s(8)
> => "225"
> irb(main):003:0> 0x97.to_s(8)
> => "227"
> irb(main):004:0> 0x8a.to_s(8)
> => "212"
>
>
> Bye

--
Posted via http://www.ruby-....

Konrad Meyer

10/30/2007 1:55:00 AM

Quoth Jamal Bengeloun:
> ...
>
> The app I am working on gets its data from different sources (yaml
> files, dBaseIV files, MS Access files) and then produces xml files (via
> builder).
>
> When using print you get the original character. When using p, you get
> the escaped equivalent.
>
> And that's only the start of your problems! When trying to get those
> characters into utf-8
>
> ...
>
> Does someone have an explanation?
>
> Does anyone know how to get those characters into the final xml files?
>
> Any help would be greatly appreciated.
>
> Jamal

In short, you're asking what the difference between "\303\251", "é",
and "‚" are.

The first is an octal sequence embedded in a string (it happens to be the
same as utf-8 'é'). The second is also utf-8 'é'. These two are the same
string ("\303\251" == "é"). The last, '‚' is the html-escaped notation
for a 'é' (I'm trusting your email for the correct number here). That is,
literally "‚" != "é", but they should render the same to a browser
capable of displaying utf-8.

HTH,
--
Konrad Meyer <konrad@tylerc.org> http://konrad.sobertil...

mortee

10/30/2007 2:16:00 AM

Jamal Bengeloun

10/30/2007 11:05:00 AM

Probably. I am a beginner in ruby.

The program gets the accented characters from a dBaseIV file, a MS
Access File and some YAML files.

I use Komodo Edit as my editor and it does handle UTF-8 correctly.

I know! That's why I did not understand why I got \202. I do not know
which charset ruby uses to convert the characters. I tried iconv and
jcode but ended up with the same results. At first I thought it was
because of the library I used (builder for example). The only
explanation I found was on that python tutorial.

Thanks.

Jamal

mortee wrote:
> Jamal Bengeloun wrote:
>> Ã gets translated to \205
>> Ã¹ gets translated to \227
>>
>> But why?
>
> I guess that your understanding is just wrong. I'm not really sure from
> where your program gets those accented chars that are translated to
> those specific escaped octal sequences. But if you're specifying them in
> string constants in your program, then it all depends on according to
> what encodig your editor displays it.
>
> For instance, I usually edit my scripts as UTF-8 text files, and I treat
> my sting constants that way too. In that case, if I put an Ã© in a string
> constant, it gets interpreted as \303\251, and not as \202. It's just
> the octal representation of the byte(s) your editor displays as a
> specific accented character.
>
> mortee

--
Posted via http://www.ruby-....

Jamal Bengeloun

10/30/2007 11:10:00 AM

Thanks a lot for your help. I thought I will be going mad with this. I
thought it had something to do with ruby being C based (I saw something
on the internet about the difference between Python and JPython and the
accented characters were encoded in UTF-8 and not html escaped).

What if the end rendering engine is not a browser (I checked and you're
absolutely right, it does work in a browser)? How to get true UTF-8
encoded characters instead of HTML escaped ones? I am using builder to
generate XML files from the data I get.

Thanks a lot for your explanation (it really did enlighten me) and your
help.

Jamal

Konrad Meyer wrote:
> Quoth Jamal Bengeloun:
>> characters into utf-8
>>
>> ...
>>
>> Does someone have an explanation?
>>
>> Does anyone know how to get those characters into the final xml files?
>>
>> Any help would be greatly appreciated.
>>
>> Jamal
>
> In short, you're asking what the difference between "\303\251", "Ã©",
> and "‚" are.
>
> The first is an octal sequence embedded in a string (it happens to be
> the
> same as utf-8 'Ã©'). The second is also utf-8 'Ã©'. These two are the same
> string ("\303\251" == "Ã©"). The last, '‚' is the html-escaped
> notation
> for a 'Ã©' (I'm trusting your email for the correct number here). That
> is,
> literally "‚" != "Ã©", but they should render the same to a browser
> capable of displaying utf-8.
>
> HTH,

--
Posted via http://www.ruby-....

Jimmy Kofler

10/30/2007 12:31:00 PM

> Jamal Bengeloun wrote:
> Thanks a lot for your help. I thought I will be going mad with this. I
> thought it had something to do with ruby being C based (I saw something
> on the internet about the difference between Python and JPython and the
> accented characters were encoded in UTF-8 and not html escaped).
>
> What if the end rendering engine is not a browser (I checked and you're
> absolutely right, it does work in a browser)? How to get true UTF-8
> encoded characters instead of HTML escaped ones? I am using builder to
> generate XML files from the data I get.
>
> Thanks a lot for your explanation (it really did enlighten me) and your
> help.
>
> Jamal

It should be possible to convert CP437 -
http://en.wikipedia.org/wiki/Cod... - to UTF-8 using iconv.

iconv -l | grep -i CP437 # => 437 CP437 IBM437 CSPC8CODEPAGE437

"How to get true UTF-8 encoded characters instead of HTML escaped ones?"

This should be doable with http://htmlentities.rub... .

(For a Ruby & UTF-8 snippet btw see
http://snippets.dzone.com/posts... ).

Cheers,

j. k.
--
Posted via http://www.ruby-....

Konrad Meyer

10/30/2007 2:13:00 PM

Quoth Jamal Bengeloun:
> Thanks a lot for your help. I thought I will be going mad with this. I
> thought it had something to do with ruby being C based (I saw something
> on the internet about the difference between Python and JPython and the
> accented characters were encoded in UTF-8 and not html escaped).
>
> What if the end rendering engine is not a browser (I checked and you're
> absolutely right, it does work in a browser)? How to get true UTF-8
> encoded characters instead of HTML escaped ones? I am using builder to
> generate XML files from the data I get.
>
> Thanks a lot for your explanation (it really did enlighten me) and your
> help.
>
> Jamal
>
> Konrad Meyer wrote:
> > Quoth Jamal Bengeloun:
> >> characters into utf-8
> >>
> >> ...
> >>
> >> Does someone have an explanation?
> >>
> >> Does anyone know how to get those characters into the final xml files?
> >>
> >> Any help would be greatly appreciated.
> >>
> >> Jamal
> >
> > In short, you're asking what the difference between "\303\251", "é",
> > and "‚" are.
> >
> > The first is an octal sequence embedded in a string (it happens to be
> > the
> > same as utf-8 'é'). The second is also utf-8 'é'. These two are the same
> > string ("\303\251" == "é"). The last, '‚' is the html-escaped
> > notation
> > for a 'é' (I'm trusting your email for the correct number here). That
> > is,
> > literally "‚" != "é", but they should render the same to a browser
> > capable of displaying utf-8.
> >
> > HTH,

If I'm not mistaken, HTML and XML encoding is the same. So you're good for
those &#xxxxxx; chars.

HTH,
--
Konrad Meyer <konrad@tylerc.org> http://konrad.sobertil...

comp.lang.ruby

to_yaml and international characters

h3raLd

Luis Parravicini

h3raLd

Richard Conroy

Jamal Bengeloun

Konrad Meyer

mortee

Jamal Bengeloun

Jamal Bengeloun

Jimmy Kofler

Konrad Meyer

x Login to ForumsZone