[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Change/ignore XML encoding?

Travis Bell

8/22/2008 12:11:00 PM

Hey guys,

I think I am missing something very basic here. I have an XML request,
using the following code as an example:

require "rubygems"
require "xml/libxml"

movie = "sin+city"
search_url =
'http://www.movie-xml.com/interfaces/getmovie.php?movie...
url = search_url+movie
doc = XML::Document.file(url)

Now, with most of the XML results I get from movie-xml.com, the default
utf-8 is fine since there are no non-utf-8 characters. When searching
Sin City as an example, there are. Here's the response I get:

Input is not proper UTF-8, indicate encoding !

The source XML has an encoding declared as such:

<?xml version="1.0" encoding="ISO-8859-1"?>

So I should probably just decode as ISO-8859-1 as well. How the hell do
I do that? I have Googled the crap out of this and just can't seem to
find what I need here...
--
Posted via http://www.ruby-....

7 Answers

matt

8/22/2008 4:33:00 PM

0

Travis Bell <travisbell@mac.com> wrote:

> Hey guys,
>
> I think I am missing something very basic here. I have an XML request,
> using the following code as an example:
>
> require "rubygems"
> require "xml/libxml"
>
> movie = "sin+city"
> search_url =
> 'http://www.movie-xml.com/interfaces/getmovie.php?movie...
> url = search_url+movie
> doc = XML::Document.file(url)
>
> Now, with most of the XML results I get from movie-xml.com, the default
> utf-8 is fine since there are no non-utf-8 characters. When searching
> Sin City as an example, there are. Here's the response I get:
>
> Input is not proper UTF-8, indicate encoding !
>
> The source XML has an encoding declared as such:
>
> <?xml version="1.0" encoding="ISO-8859-1"?>
>
> So I should probably just decode as ISO-8859-1 as well. How the hell do
> I do that? I have Googled the crap out of this and just can't seem to
> find what I need here...

Could this just be a bug in Libxml? REXML seems to do the right thing...
m.


--
matt neuburg, phd = matt@tidbits.com, http://www.tidbits...
Leopard - http://www.takecontrolbooks.com/leopard-custom...
AppleScript - http://www.amazon.com/gp/product/...
Read TidBITS! It's free and smart. http://www.t...

Eric I.

8/22/2008 8:48:00 PM

0

On Aug 22, 12:32 pm, m...@tidbits.com (matt neuburg) wrote:
> > Now, with most of the XML results I get from movie-xml.com, the default
> > utf-8 is fine since there are no non-utf-8 characters. When searching
> > Sin City as an example, there are. Here's the response I get:
>
> > Input is not proper UTF-8, indicate encoding !
>
> > The source XML has an encoding declared as such:
>
> > <?xml version="1.0" encoding="ISO-8859-1"?>
>
> > So I should probably just decode as ISO-8859-1 as well. How the hell do
> > I do that? I have Googled the crap out of this and just can't seem to
> > find what I need here...
>
> Could this just be a bug in Libxml? REXML seems to do the right thing...

Clearly libxml is expecting UTF-8, even though the XML file specifies
that it's encoded in ISO-8859-1. So that's a bug.

However, it appears that libxml is "correctly" rejecting data that is
not proper UTF-8 (independent of what it claims to be). Twice in the
XML data the word "verg?enza" appears where the "?" has hex code 0xFC
that encodes a lower case "u" with umlaut in ISO-8859-1. 0xFC cannot
appear in UTF-8 data due to RFC-3629.

libxml should work with ISO-8859-1 data much of the time, as long as
it doesn't contain 13 specific bytes (0xC0, 0xC1, 0xF5..0xFF).

Eric

====

Are you interested in on-site Ruby or Ruby on Rails training
that uses well-designed, real-world, hands-on exercises?
http://Lea...

Travis Bell

8/22/2008 10:37:00 PM

0

Eric I. wrote:
> Clearly libxml is expecting UTF-8, even though the XML file specifies
> that it's encoded in ISO-8859-1. So that's a bug.
>
> libxml should work with ISO-8859-1 data much of the time, as long as
> it doesn't contain 13 specific bytes (0xC0, 0xC1, 0xF5..0xFF).

Heh, so is there a way around this aside from using REXML? Are we
concluding this is a bug in libxml?


--
Posted via http://www.ruby-....

matt

8/23/2008 4:36:00 PM

0

Travis Bell <travisbell@mac.com> wrote:

> Eric I. wrote:
> > Clearly libxml is expecting UTF-8, even though the XML file specifies
> > that it's encoded in ISO-8859-1. So that's a bug.
> >
> > libxml should work with ISO-8859-1 data much of the time, as long as
> > it doesn't contain 13 specific bytes (0xC0, 0xC1, 0xF5..0xFF).
>
> Heh, so is there a way around this aside from using REXML?

Well, if you really want to, I suppose you could parse the encoding info
yourself, convert the encoding of the entire text and change the
encoding info to utf8, and then open with libxml.

> Are we
> concluding this is a bug in libxml?

Not sure. Couldn't hurt to report it, though. It has its own google
group and its own bug reporting page... m.

--
matt neuburg, phd = matt@tidbits.com, http://www.tidbits...
Leopard - http://www.takecontrolbooks.com/leopard-custom...
AppleScript - http://www.amazon.com/gp/product/...
Read TidBITS! It's free and smart. http://www.t...

Travis Bell

8/23/2008 11:48:00 PM

0

matt neuburg wrote:
> Well, if you really want to, I suppose you could parse the encoding info
> yourself, convert the encoding of the entire text and change the
> encoding info to utf8, and then open with libxml.
>
>> Are we
>> concluding this is a bug in libxml?
>
> Not sure. Couldn't hurt to report it, though. It has its own google
> group and its own bug reporting page... m.

Right on. For now I just switched to rexml and without any special
change everything parses properly. Good for anyone else to know for
future reference.


--
Posted via http://www.ruby-....

matt

8/24/2008 5:02:00 PM

0

Travis Bell <travisbell@mac.com> wrote:

> matt neuburg wrote:
> > Well, if you really want to, I suppose you could parse the encoding info
> > yourself, convert the encoding of the entire text and change the
> > encoding info to utf8, and then open with libxml.
> >
> >> Are we
> >> concluding this is a bug in libxml?
> >
> > Not sure. Couldn't hurt to report it, though. It has its own google
> > group and its own bug reporting page... m.
>
> Right on. For now I just switched to rexml and without any special
> change everything parses properly. Good for anyone else to know for
> future reference.

Okay, but that helps no one since you didn't submit the bug. So I
submitted it for you. m.

--
matt neuburg, phd = matt@tidbits.com, http://www.tidbits...
Leopard - http://www.takecontrolbooks.com/leopard-custom...
AppleScript - http://www.amazon.com/gp/product/...
Read TidBITS! It's free and smart. http://www.t...

Erik Hollensbe

8/24/2008 8:22:00 PM

0

matt neuburg wrote:
> Travis Bell <travisbell@mac.com> wrote:
>
>>
>> Right on. For now I just switched to rexml and without any special
>> change everything parses properly. Good for anyone else to know for
>> future reference.
>
> Okay, but that helps no one since you didn't submit the bug. So I
> submitted it for you. m.

http://rubyforge.org/tracker/?func=detail&atid=1971&aid=21658&gr...

Here's the link so the concerned can follow its status.

-Erik
--
Posted via http://www.ruby-....