[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Re: REXML::Document could not parse UTF-8 "\302"

Yukihiro Matsumoto

1/5/2008 7:02:00 PM

Hi,

In message "Re: REXML::Document could not parse UTF-8 "<name>\302</name>""
on Sun, 6 Jan 2008 03:00:04 +0900, "Jesse P." <j.prabawa@gmail.com> writes:

|Thanks for your help. So I guess my problem is this:
|1. I get an XML that is declared to be valid UTF-8, but
|2. when I process some of the values, as you pointed out, some is not
|valid UTF-8, and
|3. causes a lot of problems when parsed by REXML.
|
|For a string of characters (e.g. some xml file), is there anyway I can
|detect just the non UTF-8 characters and convert them to UTF-8?

I guess you have to define what you want to do with this broken UTF-8
data first. As long as you treat the data as UTF-8, it is impossible
to treat it correctly. You can either

* fix the data before reading it via REXML
* parse data as Latin-1 or some other single byte encoding
* replace the broken data with some valid UTF-8 sequence

But YMMV.

matz.

1 Answer

Jesse P.

1/6/2008 1:29:00 PM

0

Thanks Matz :)

On Jan 6, 3:01 am, Yukihiro Matsumoto <m...@ruby-lang.org> wrote:
> Hi,
>
> In message "Re: REXML::Document could not parse UTF-8 "<name>\302</name>""
> on Sun, 6 Jan 2008 03:00:04 +0900, "Jesse P." <j.prab...@gmail.com> writes:
>
> |Thanks for your help. So I guess my problem is this:
> |1. I get an XML that is declared to be valid UTF-8, but
> |2. when I process some of the values, as you pointed out, some is not
> |valid UTF-8, and
> |3. causes a lot of problems when parsed by REXML.
> |
> |For a string of characters (e.g. some xml file), is there anyway I can
> |detect just the non UTF-8 characters and convert them to UTF-8?
>
> I guess you have to define what you want to do with this broken UTF-8
> data first. As long as you treat the data as UTF-8, it is impossible
> to treat it correctly. You can either
>
> * fix the data before reading it via REXML
> * parse data as Latin-1 or some other single byte encoding
> * replace the broken data with some valid UTF-8 sequence
>
> But YMMV.
>
> matz.