[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

REXML: parsing a string with unescaped ampersand entities

Frank Reiff

12/7/2007 6:12:00 PM

Hi,

REXML seems to SOMETIMES choke on parsing ampersands within entities,
e.g.

string = '<?xml version="1.0"
encoding="UTF-8"?><hello>hello&world</hello>'
doc = Document.new(string)
puts "#{doc}"

works fine (output below):
<?xml version='1.0' encoding='UTF-8'?><hello>hello&world</hello>

BUT:

string = '<?xml version="1.0" encoding="UTF-8"?><hello>hello&
world</hello>'
doc = Document.new(string)
puts "#{doc}"

crashes out with:

REXML::ParseException: #<RuntimeError: Illegal character '&' in raw
string "hello& world">
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/text.rb:91:in
â??initializeâ??
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:43:in
`new'
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:43:in
`parse'
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/document.rb:190:in
`build'
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/document.rb:45:in
`initialize' /Users/frankreiff/Live
Developments/ruby/analyze/xml_parser.rb:102:in `new'
/Users/frankreiff/Live Developments/ruby/analyze/xml_parser.rb:102 ...
Illegal character '&' in raw string "hello& world" Line: Position: Last
80 unconsumed characters: </hello>

The difference is the space after the &

What is going on? and how can I fix this?

Best regards,

Frank
--
Posted via http://www.ruby-....

3 Answers

Bob Hutchison

12/7/2007 7:18:00 PM

0

Hi,

On 7-Dec-07, at 1:12 PM, Frank Reiff wrote:

> Hi,
>
> REXML seems to SOMETIMES choke on parsing ampersands within entities,
> e.g.
>
> string = '<?xml version="1.0"
> encoding="UTF-8"?><hello>hello&world</hello>'
> doc = Document.new(string)
> puts "#{doc}"
>
> works fine (output below):
> <?xml version='1.0' encoding='UTF-8'?><hello>hello&world</hello>
>
> BUT:
>
> string = '<?xml version="1.0" encoding="UTF-8"?><hello>hello&
> world</hello>'
> doc = Document.new(string)
> puts "#{doc}"
>

[ snip]

>
> What is going on? and how can I fix this?

Neither is legal XML, both should fail. You can either escape the
content or use a CDATA block.

Cheers,
Bob

>
>
> Best regards,
>
> Frank
> --
> Posted via http://www.ruby-....
>

----
Bob Hutchison -- tumblelog at http://www.recurs...
Recursive Design Inc. -- weblog at http://www.recursiv...
http://www.rec... -- works on http://www.raconteur.info/cms-for-static-con...



Frank Reiff

12/7/2007 7:40:00 PM

0

> Neither is legal XML, both should fail. You can either escape the
> content or use a CDATA block.

You're of course right. Both are illegal.

Somebody suggested to me that the original problem might be caused by
incorrectly encoded entities (&amp; &quot;) and reading through the w3c
spec (always a bad idea) got me confused to the extend of believing that
you only had to encode character entities in attribute values; which
isn't the case. Can't in fact be the case, otherwise the parser couldn't
differentiate between a "normal" ampersand and the beginning of a
character entity.

Which brings me back to my original problem of receiving a truncated XML
as an HTML post (see my previous question). This ONLY HAPPENS when there
is an ampersand somewhere in the message.

Could it be that CGI.params behaves differently when there is an
ampersand in the request, e.g. it tries to parse the request into
key/value pairs and returns a hash rather than a simple string in that
case!?

I think I might be on to something there..
--
Posted via http://www.ruby-....

Frank Reiff

12/7/2007 8:22:00 PM

0

> I think I might be on to something there..

Ok, it was in fact precisely that. When I do a :

cgi.params.to_s

I get the correctly formatted XML message, but when there is an & in the
message

cgi.params.to_s

this produces an erratic output.

This is of course because:

The method params() returns a hash of all parameters in the request as
name/value-list pairs, where the value-list is an Array of one or more
values. The CGI object itself also behaves as a hash of parameter names
to values, but only returns a single value (as a String) for each
parameter name.

The output is therefore a fluke that's solely based on the fact that
there is only one parameter.

Now my FINAL question to all the Ruby gurus:

* How do I get the POST-ed message body without any clever splitting
into key/value pairs!?



--
Posted via http://www.ruby-....