Asp Forum - replace delimiter in unicode encdoded file

ciapecki

12/4/2006 1:38:00 PM

Is there a way in ruby to:
- open a file encoded in ucs-2le,
- replace every occurance of '\t' (X'0009') with ',' (X'002c'),
- and save it back in ucs-2le, without loosing any content?

thanks
chris

19 Answers

Ross Bamford

12/5/2006 1:49:00 AM

On Mon, 2006-12-04 at 22:40 +0900, ciapecki wrote:
> Is there a way in ruby to:
> - open a file encoded in ucs-2le,
> - replace every occurance of '\t' (X'0009') with ',' (X'002c'),
> - and save it back in ucs-2le, without loosing any content?

Well, you _could_ do it with iconv:

$ irb -riconv

data = File.read('test')
# => "a\000b\000c\000\t\000\273\006\t\0001\000"

str = Iconv.iconv('utf-8', 'ucs-2le', data).first
# => "abc\t\332\273\t1"

newstr = str.tr("\t", ',')
# => "abc,\332\273,1"

newdata = Iconv.iconv('ucs-2le', 'utf-8', newstr).first
# => "a\000b\000c\000,\000\273\006,\0001\000"

But that strikes me as unnecessary when you could just do:

newdata = File.read('test').tr("\t", ',')
# => "a\000b\000c\000,\000\273\006,\0001\000"

;)

Hope that helps,
--
Ross Bamford - rosco@roscopeco.REMOVE.co.uk

David Vallner

12/6/2006 6:11:00 AM

Ross Bamford wrote:
> On Mon, 2006-12-04 at 22:40 +0900, ciapecki wrote:
>> Is there a way in ruby to:
>> - open a file encoded in ucs-2le,
>> - replace every occurance of '\t' (X'0009') with ',' (X'002c'),
>> - and save it back in ucs-2le, without loosing any content?
> But that strikes me as unnecessary when you could just do:
>
> newdata = File.read('test').tr("\t", ',')
> # => "a\000b\000c\000,\000\273\006,\0001\000"
>

Um. Other way around. *Old* data is in UCS-2LE, not in UTF-8, so it's
not ASCII-transparent. Your iconv approach could work if you swapped
around the encoding names, except you'd probably also have to involve a
$KCODE = 'u' and require 'jcode' to avoid clobbering the possible cases
where in UTF8, 0x09 and 0x2c are part of a multibyte sequence.

David Vallner

ciapecki

12/6/2006 12:02:00 PM

David Vallner schrieb:

> Ross Bamford wrote:
> > On Mon, 2006-12-04 at 22:40 +0900, ciapecki wrote:
> >> Is there a way in ruby to:
> >> - open a file encoded in ucs-2le,
> >> - replace every occurance of '\t' (X'0009') with ',' (X'002c'),
> >> - and save it back in ucs-2le, without loosing any content?
> > But that strikes me as unnecessary when you could just do:
> >
> > newdata = File.read('test').tr("\t", ',')
> > # => "a\000b\000c\000,\000\273\006,\0001\000"
> >
>
> Um. Other way around. *Old* data is in UCS-2LE, not in UTF-8, so it's
> not ASCII-transparent. Your iconv approach could work if you swapped
> around the encoding names, except you'd probably also have to involve a
> $KCODE = 'u' and require 'jcode' to avoid clobbering the possible cases
> where in UTF8, 0x09 and 0x2c are part of a multibyte sequence.
>
> David Vallner
>
>
> --------------enig4A00E1A3DAAB09EEF0C6DD3E
> Content-Type: application/pgp-signature
> Content-Disposition: inline;
> filename="signature.asc"
> Content-Description: OpenPGP digital signature
> X-Google-AttachSize: 188

Thanks Ross for the try, but it is not working,
tried for:

"\377\376B\001\363\000|\001k\000o\000\t\000k\000s\000i\000\005\001|\001k\000a\000\t\000c\000z\000B\001o\000w\000i\000e\000k\000\r\000\n\000B\001\005\001k\000a\000\t\000\t\000|\001d\000z\001b\000B\001o\000\r\000\n\000"
which is:

lózko ksiazka czlowiek
laka zdzblo

-> (the same :))

the conversion should be:
lózko,ksiazka,czlowiek
laka,,zdzblo

but with the Iconv try:
lózko,ksiazka,czlowiek
???????????????

after swapping utf-8 to ucs-2le in the both iconv convertions, I get an
error message:
`iconv': "\377\376B\001¾ |?k\000o\000\t\000k\000"...
(Iconv::IllegalSequence)

Any other suggestions highly appreciated.

Thanks
chris

Ross Bamford

12/6/2006 12:49:00 PM

On Wed, 06 Dec 2006 12:01:37 -0000, ciapecki <ciapecki@gmail.com> wrote:
> David Vallner schrieb:
>> Ross Bamford wrote:
>> > On Mon, 2006-12-04 at 22:40 +0900, ciapecki wrote:
>> >> Is there a way in ruby to:
>> >> - open a file encoded in ucs-2le,
>> >> - replace every occurance of '\t' (X'0009') with ',' (X'002c'),
>> >> - and save it back in ucs-2le, without loosing any content?
>> > But that strikes me as unnecessary when you could just do:
>> >
>> > newdata = File.read('test').tr("\t", ',')
>> > # => "a\000b\000c\000,\000\273\006,\0001\000"
>> >
>>
>> Um. Other way around. *Old* data is in UCS-2LE, not in UTF-8, so it's
>> not ASCII-transparent. Your iconv approach could work if you swapped
>> around the encoding names, except you'd probably also have to involve a
>> $KCODE = 'u' and require 'jcode' to avoid clobbering the possible cases
>> where in UTF8, 0x09 and 0x2c are part of a multibyte sequence.
>>
>
> Thanks Ross for the try, but it is not working,
> tried for:
>
> "\377\376B\001\363\000|\001k\000o\000\t\000k\000s\000i\000\005\001|\001k\000a\000\t\000c\000z\000B\001o\000w\000i\000e\000k\000\r\000\n\000B\001\005\001k\000a\000\t\000\t\000|\001d\000z\001b\000B\001o\000\r\000\n\000"
> which is:
>
> Å?Ã³Å¼ko ksiÄ?Å¼ka czÅ?owiek
> Å?Ä?ka Å¼dÅºbÅ?o
>
> -> (the same :))
>
> the conversion should be:
> Å?Ã³Å¼ko,ksiÄ?Å¼ka,czÅ?owiek
> Å?Ä?ka,,Å¼dÅºbÅ?o
>
> but with the Iconv try:
> Å?Ã³Å¼ko,ksiÄ?Å¼ka,czÅ?owiek
> à¨ä??Ôæ¬æ??â°?â°?ç°?æç¨?æ?ä??æ¼à´?à´?
>
> after swapping utf-8 to ucs-2le in the both iconv convertions, I get an
> error message:
> `iconv': "\377\376B\001Â¾ |â?ºk\000o\000\t\000k\000"...
> (Iconv::IllegalSequence)
>
>
> Any other suggestions highly appreciated.
>

I think David is confusing the order of the 'from' and 'to' arguments to
Iconv.iconv - they go: (to, from, data). My short example was
ill-conceived, though - this might be safer:

$ irb -riconv

s = <the string you show above>

s.gsub(/\t\000(?!\000)/, ",\000")
# =>
"\377\376B\001\363\000|\001k\000o\000,\000k\000s\000i\000\005\001|\001k\000a\000,\000c\000z\000B\001o\000w\000i\000e\000k\000\r\000\n\000B\001\005\001k\000a\000,\000,\000|\001d\000z\001b\000B\001o\000\r\000\n\000"

(This is:

Å?Ã³Å¼ko,ksiÄ?Å¼ka,czÅ?owiek
Å?Ä?ka,,Å¼dÅºbÅ?o
)

But I'm not totally sure, so you might be better with iconv anyway:

Iconv.iconv('ucs-2le', 'utf-8', Iconv.iconv('utf-8','ucs-2le',
s).first.gsub(/\t/u, ',')).first
# =>
"\377\376B\001\363\000|\001k\000o\000,\000k\000s\000i\000\005\001|\001k\000a\000,\000c\000z\000B\001o\000w\000i\000e\000k\000\r\000\n\000B\001\005\001k\000a\000,\000,\000|\001d\000z\001b\000B\001o\000\r\000\n\000"

(This too is:

Å?Ã³Å¼ko,ksiÄ?Å¼ka,czÅ?owiek
Å?Ä?ka,,Å¼dÅºbÅ?o
)

Unless I missed something, this seems to work fine here. Does it work for
you?

--
Ross Bamford - rosco@roscopeco.remove.co.uk

ciapecki

12/6/2006 5:56:00 PM

> On Wed, 06 Dec 2006 12:01:37 -0000, ciapecki <ciapecki@gmail.com> wrote:
> > David Vallner schrieb:
> >> Ross Bamford wrote:
> >> > On Mon, 2006-12-04 at 22:40 +0900, ciapecki wrote:
> >> >> Is there a way in ruby to:
> >> >> - open a file encoded in ucs-2le,
> >> >> - replace every occurance of '\t' (X'0009') with ',' (X'002c'),
> >> >> - and save it back in ucs-2le, without loosing any content?
> >> > But that strikes me as unnecessary when you could just do:
> >> >
> >> > newdata = File.read('test').tr("\t", ',')
> >> > # => "a\000b\000c\000,\000\273\006,\0001\000"
> >> >
> >>
> >> Um. Other way around. *Old* data is in UCS-2LE, not in UTF-8, so it's
> >> not ASCII-transparent. Your iconv approach could work if you swapped
> >> around the encoding names, except you'd probably also have to involve a
> >> $KCODE = 'u' and require 'jcode' to avoid clobbering the possible cases
> >> where in UTF8, 0x09 and 0x2c are part of a multibyte sequence.
> >>
> >
> > Thanks Ross for the try, but it is not working,
> > tried for:
> >
> > "\377\376B\001\363\000|\001k\000o\000\t\000k\000s\000i\000\005\001|\001k\000a\000\t\000c\000z\000B\001o\000w\000i\000e\000k\000\r\000\n\000B\001\005\001k\000a\000\t\000\t\000|\001d\000z\001b\000B\001o\000\r\000\n\000"
> > which is:
> >
> > lózko ksiazka czlowiek
> > laka zdzblo
> >
> > -> (the same :))
> >
> > the conversion should be:
> > lózko,ksiazka,czlowiek
> > laka,,zdzblo
> >
> > but with the Iconv try:
> > lózko,ksiazka,czlowiek
> > ???????????????
> >
> > after swapping utf-8 to ucs-2le in the both iconv convertions, I get an
> > error message:
> > `iconv': "\377\376B\001¾ |?k\000o\000\t\000k\000"...
> > (Iconv::IllegalSequence)
> >
> >
> > Any other suggestions highly appreciated.
> >
>
> I think David is confusing the order of the 'from' and 'to' arguments to
> Iconv.iconv - they go: (to, from, data). My short example was
> ill-conceived, though - this might be safer:
>
> $ irb -riconv
>
> s = <the string you show above>
>
> s.gsub(/\t\000(?!\000)/, ",\000")
> # =>
> "\377\376B\001\363\000|\001k\000o\000,\000k\000s\000i\000\005\001|\001k\000a\000,\000c\000z\000B\001o\000w\000i\000e\000k\000\r\000\n\000B\001\005\001k\000a\000,\000,\000|\001d\000z\001b\000B\001o\000\r\000\n\000"
>
> (This is:
>
> lózko,ksiazka,czlowiek
> laka,,zdzblo
> )
>
> But I'm not totally sure, so you might be better with iconv anyway:
>
> Iconv.iconv('ucs-2le', 'utf-8', Iconv.iconv('utf-8','ucs-2le',
> s).first.gsub(/\t/u, ',')).first
> # =>
> "\377\376B\001\363\000|\001k\000o\000,\000k\000s\000i\000\005\001|\001k\000a\000,\000c\000z\000B\001o\000w\000i\000e\000k\000\r\000\n\000B\001\005\001k\000a\000,\000,\000|\001d\000z\001b\000B\001o\000\r\000\n\000"
>
> (This too is:
>
> lózko,ksiazka,czlowiek
> laka,,zdzblo
> )
>
> Unless I missed something, this seems to work fine here. Does it work for
> you?
>
> --
> Ross Bamford - rosco@roscopeco.remove.co.uk

Thanks Ross,

I was that stupid and forgot to open the writable file as binary "wb"
(before I had "w" only)

Thanks again for your help
chris

Paul Lutus

12/6/2006 6:40:00 PM

ciapecki wrote:

/ ...

> I was that stupid and forgot to open the writable file as binary "wb"
> (before I had "w" only)

Don't kick yourself too hard, the error lies with Microsoft trying to golf
its way out of a thicket of its own making. There never should have been
two standard line endings (actually three if you include the Mac), and
there never should have been two path delimiters either, both of which
cause endless headaches for cross-platform coders.

The reason these variations exist is so someone can say, "my software is
different, unique, patentable, now you have to pay me for it." Even if the
differences convey no benefit to the users.

--
Paul Lutus
http://www.ara...

ciapecki

12/6/2006 7:07:00 PM

Another question following up.
Is there a way to find out in what encoding is the file encoded (is it
ucs-2le or utf-8)?
when I open a file in VIM I can check it with :set fileencoding
so there must be any way to recognize the file and its encoding.

Thanks
chris

David Kastrup

12/6/2006 7:59:00 PM

Paul Lutus <nospam@nosite.zzz> writes:

> ciapecki wrote:
>
> / ...
>
>> I was that stupid and forgot to open the writable file as binary "wb"
>> (before I had "w" only)
>
> Don't kick yourself too hard, the error lies with Microsoft trying
> to golf its way out of a thicket of its own making. There never
> should have been two standard line endings (actually three if you
> include the Mac), and there never should have been two path
> delimiters either, both of which cause endless headaches for
> cross-platform coders.
>
> The reason these variations exist is so someone can say, "my
> software is different, unique, patentable, now you have to pay me
> for it." Even if the differences convey no benefit to the users.

No, the reason is that CP/M had no tty concept, and consequently no
automatic LF->CRLF translation (and CRLF is required on printers).
Also forward slashes were used in CP/M as option lead-ins (CP/M, not
having named directories, did not need to use forwards slashes for
those).

This legacy is from long before POSIX, in fact, from long before C.

--
David Kastrup, Kriemhildstr. 15, 44793 Bochum

David Vallner

12/6/2006 9:23:00 PM

Ross Bamford wrote:
> I think David is confusing the order of the 'from' and 'to' arguments to
> Iconv.iconv - they go: (to, from, data).

/me puts on dunce hat.

Sorry! I recall always using the command-line iconv specifying them in
from,to order, and apparently that burned deeper into my brain pathways
than it should have.

David Vallner

David Vallner

12/6/2006 9:36:00 PM

David Kastrup wrote:
> Paul Lutus <nospam@nosite.zzz> writes:
>
>> ciapecki wrote:
>>
>> / ...
>>
>>> I was that stupid and forgot to open the writable file as binary "wb"
>>> (before I had "w" only)
>> Don't kick yourself too hard, the error lies with Microsoft trying
>> to golf its way out of a thicket of its own making. There never
>> should have been two standard line endings (actually three if you
>> include the Mac), and there never should have been two path
>> delimiters either, both of which cause endless headaches for
>> cross-platform coders.
>>
>> The reason these variations exist is so someone can say, "my
>> software is different, unique, patentable, now you have to pay me
>> for it." Even if the differences convey no benefit to the users.
>
> No, the reason is that CP/M had no tty concept, and consequently no
> automatic LF->CRLF translation (and CRLF is required on printers).
> Also forward slashes were used in CP/M as option lead-ins (CP/M, not
> having named directories, did not need to use forwards slashes for
> those).
>
> This legacy is from long before POSIX, in fact, from long before C.
>

Hrm, and I also recall once knowing about why the different text /
binary file handling was around. Something to do with some DOS
programming environment and efficient (by a measure that could only have
been important enough to warrant a design wart on the hardware from
then) line-oriented text processing.

I don't think there's any distinction between the file modes on the OS
level anymore, but programming language runtimes interpret the absence
of the 'b' flag as "translate newlines" to only have to internally
support one convention and avoid having to have every text manipulation
routine handle the difference gracefully.

The blurb about preserving the idiosyncracies as a business strategy is
hilarious. Also patent nonsense and FUD ;)

David Vallner

comp.lang.ruby

replace delimiter in unicode encdoded file

ciapecki

Ross Bamford

David Vallner

ciapecki

Ross Bamford

ciapecki

Paul Lutus

ciapecki

David Kastrup

David Vallner

David Vallner

x Login to ForumsZone