Jesse P.
1/6/2008 1:29:00 PM
Thanks Matz :)
On Jan 6, 3:01 am, Yukihiro Matsumoto <m...@ruby-lang.org> wrote:
> Hi,
>
> In message "Re: REXML::Document could not parse UTF-8 "<name>\302</name>""
> on Sun, 6 Jan 2008 03:00:04 +0900, "Jesse P." <j.prab...@gmail.com> writes:
>
> |Thanks for your help. So I guess my problem is this:
> |1. I get an XML that is declared to be valid UTF-8, but
> |2. when I process some of the values, as you pointed out, some is not
> |valid UTF-8, and
> |3. causes a lot of problems when parsed by REXML.
> |
> |For a string of characters (e.g. some xml file), is there anyway I can
> |detect just the non UTF-8 characters and convert them to UTF-8?
>
> I guess you have to define what you want to do with this broken UTF-8
> data first. As long as you treat the data as UTF-8, it is impossible
> to treat it correctly. You can either
>
> * fix the data before reading it via REXML
> * parse data as Latin-1 or some other single byte encoding
> * replace the broken data with some valid UTF-8 sequence
>
> But YMMV.
>
> matz.