Asp Forum - Re: Premature end of regular expression with non-ascii chara

Nuralanur

1/30/2006 11:25:00 PM

When I read in a text with accents from a file under cygwin, these get
converted to something like '\352'.
You can then search for these using regexps:

a="un texte extrêmement énervant"
p splitted_text=a.split(/(?=)/)
b=/extr\352mement/
d=a.match(b)
p d[0] => extr\352mement

When I write the result to a file, it appears correctly as "extrêmement".

f=File.new("t.txt",'w')
f.puts d[0]
f.close

Hope that helps,

Best regards,

Axel

6 Answers

Nick Snels

1/31/2006 11:08:00 AM

Hi Axel,

thanks for the reply. If I try your code, my characters with accents
don't get translated to numbers, unfortunately. Do you know where these
numbers come from, I looked on the net but \352 is not the octal,
hexadecimal or UTF-8 representation of Ãª . Could you split the following
sentence for me and let me know what the result is:

a="Ils sont trÃ¨s Ã©nervÃ© les regexps."
splitted_text=a.split(/\s/)

Not my best French. But if I try this, 'trÃ¨s Ã©nervÃ© les' is still one
part, eventhough I split it on the spaces. Maybe it is different with
you and then I have to look deeper. Thanks for your help. If anybody is
able to split is like 'trÃ¨s', 'Ã©nervÃ©', 'les' please let me know!!

Kind regards,

Nick

--
Posted via http://www.ruby-....

Lugovoi Nikolai

1/31/2006 11:33:00 AM

The odds are your text is in non-UTF8 encoding, but in CP1252 or similar.
Then indeed, if $KCODE = 'u' split won't work right.

2006/1/31, Nick Snels <nick.snels@gmail.com>:
> Hi Axel,
>
> thanks for the reply. If I try your code, my characters with accents
> don't get translated to numbers, unfortunately. Do you know where these
> numbers come from, I looked on the net but \352 is not the octal,
> hexadecimal or UTF-8 representation of ê . Could you split the following
> sentence for me and let me know what the result is:
>
> a="Ils sont très énervé les regexps."
> splitted_text=a.split(/\s/)
>
> Not my best French. But if I try this, 'très énervé les' is still one
> part, eventhough I split it on the spaces. Maybe it is different with
> you and then I have to look deeper. Thanks for your help. If anybody is
> able to split is like 'très', 'énervé', 'les' please let me know!!
>
> Kind regards,
>
> Nick
>
> --
> Posted via http://www.ruby-....
>
>

Nick Snels

1/31/2006 12:12:00 PM

Indeed, it isn't in UTF-8. It's in ISO-8859-1 (Latin1). The problem here
is that I would like to work in UTF-8, but I have to read in files. And
these files are often (almost always) in ISO-8859-1. And I haven't found
a way of converting these strings to Unicode in Ruby. Ã© and Ã¨ etc. form
part of ISO-8859-1.

Anyway I remove $KCODE altogether in config/environment.rb and now it
works. And Axel I also get the numbers. In config/environment.rb I
added:

$KCODE = 'u'
require 'jcode'

to get Gettext to work. So it turns out that if you aren't fully working
in UTF-8, you have to be carefull adding this.

Thanks for pointing me to $KCODE, twice!

Kind regards,

Nick

--
Posted via http://www.ruby-....

Lugovoi Nikolai

1/31/2006 12:22:00 PM

2006/1/31, Nick Snels <nick.snels@gmail.com>:
> Indeed, it isn't in UTF-8. It's in ISO-8859-1 (Latin1). The problem here
> is that I would like to work in UTF-8, but I have to read in files. And
> these files are often (almost always) in ISO-8859-1. And I haven't found
> a way of converting these strings to Unicode in Ruby. é and è etc. form
> part of ISO-8859-1.
>

use Iconv library

Lars Broecker

1/31/2006 12:36:00 PM

Nick Snels wrote:
> Indeed, it isn't in UTF-8. It's in ISO-8859-1 (Latin1). The problem here
> is that I would like to work in UTF-8, but I have to read in files. And
> these files are often (almost always) in ISO-8859-1. And I haven't found
> a way of converting these strings to Unicode in Ruby. Ã© and Ã¨ etc. form
> part of ISO-8859-1.

I have to deal with similar problems when processing the infamous german
umlaute Ã¤Ã¶Ã¼. My solution has been to convert a string from latin1 or
latin15 to utf8 via this
utf8_string=latin1_string.unpack("C*").pack("U*")

and the other way round with
latin1_string=utf8_string.unpack("U*").pack("C*")

Did work so far and does not include changes in the environment.
HTH,
Lars

Nick Snels

1/31/2006 1:19:00 PM

Hi Nikolai,

thanks for the suggestion I will definitely give Iconv a try. Hope it
doesn't slow things down a lot.

Kind regards,

Nick

--
Posted via http://www.ruby-....

comp.lang.ruby

Re: Premature end of regular expression with non-ascii chara

Nuralanur

Nick Snels

Lugovoi Nikolai

Nick Snels

Lugovoi Nikolai

Lars Broecker

Nick Snels

x Login to ForumsZone