Asp Forum
Home
|
Login
|
Register
|
Search
Forums
>
comp.lang.ruby
REXML: parsing a string with unescaped ampersand entities
Frank Reiff
12/7/2007 6:12:00 PM
Hi,
REXML seems to SOMETIMES choke on parsing ampersands within entities,
e.g.
string = '<?xml version="1.0"
encoding="UTF-8"?><hello>hello&world</hello>'
doc = Document.new(string)
puts "#{doc}"
works fine (output below):
<?xml version='1.0' encoding='UTF-8'?><hello>hello&world</hello>
BUT:
string = '<?xml version="1.0" encoding="UTF-8"?><hello>hello&
world</hello>'
doc = Document.new(string)
puts "#{doc}"
crashes out with:
REXML::ParseException: #<RuntimeError: Illegal character '&' in raw
string "hello& world">
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/text.rb:91:in
â??initializeâ??
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:43:in
`new'
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:43:in
`parse'
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/document.rb:190:in
`build'
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/document.rb:45:in
`initialize' /Users/frankreiff/Live
Developments/ruby/analyze/xml_parser.rb:102:in `new'
/Users/frankreiff/Live Developments/ruby/analyze/xml_parser.rb:102 ...
Illegal character '&' in raw string "hello& world" Line: Position: Last
80 unconsumed characters: </hello>
The difference is the space after the &
What is going on? and how can I fix this?
Best regards,
Frank
--
Posted via
http://www.ruby-...
.
3 Answers
Bob Hutchison
12/7/2007 7:18:00 PM
0
Hi,
On 7-Dec-07, at 1:12 PM, Frank Reiff wrote:
> Hi,
>
> REXML seems to SOMETIMES choke on parsing ampersands within entities,
> e.g.
>
> string = '<?xml version="1.0"
> encoding="UTF-8"?><hello>hello&world</hello>'
> doc = Document.new(string)
> puts "#{doc}"
>
> works fine (output below):
> <?xml version='1.0' encoding='UTF-8'?><hello>hello&world</hello>
>
> BUT:
>
> string = '<?xml version="1.0" encoding="UTF-8"?><hello>hello&
> world</hello>'
> doc = Document.new(string)
> puts "#{doc}"
>
[ snip]
>
> What is going on? and how can I fix this?
Neither is legal XML, both should fail. You can either escape the
content or use a CDATA block.
Cheers,
Bob
>
>
> Best regards,
>
> Frank
> --
> Posted via
http://www.ruby-...
.
>
----
Bob Hutchison -- tumblelog at
http://www.recurs...
Recursive Design Inc. -- weblog at
http://www.recursiv...
http://www.rec...
-- works on
http://www.raconteur.info/cms-for-static-con...
Frank Reiff
12/7/2007 7:40:00 PM
0
> Neither is legal XML, both should fail. You can either escape the
> content or use a CDATA block.
You're of course right. Both are illegal.
Somebody suggested to me that the original problem might be caused by
incorrectly encoded entities (& ") and reading through the w3c
spec (always a bad idea) got me confused to the extend of believing that
you only had to encode character entities in attribute values; which
isn't the case. Can't in fact be the case, otherwise the parser couldn't
differentiate between a "normal" ampersand and the beginning of a
character entity.
Which brings me back to my original problem of receiving a truncated XML
as an HTML post (see my previous question). This ONLY HAPPENS when there
is an ampersand somewhere in the message.
Could it be that CGI.params behaves differently when there is an
ampersand in the request, e.g. it tries to parse the request into
key/value pairs and returns a hash rather than a simple string in that
case!?
I think I might be on to something there..
--
Posted via
http://www.ruby-...
.
Frank Reiff
12/7/2007 8:22:00 PM
0
> I think I might be on to something there..
Ok, it was in fact precisely that. When I do a :
cgi.params.to_s
I get the correctly formatted XML message, but when there is an & in the
message
cgi.params.to_s
this produces an erratic output.
This is of course because:
The method params() returns a hash of all parameters in the request as
name/value-list pairs, where the value-list is an Array of one or more
values. The CGI object itself also behaves as a hash of parameter names
to values, but only returns a single value (as a String) for each
parameter name.
The output is therefore a fluke that's solely based on the fact that
there is only one parameter.
Now my FINAL question to all the Ruby gurus:
* How do I get the POST-ed message body without any clever splitting
into key/value pairs!?
--
Posted via
http://www.ruby-...
.
Servizio di avviso nuovi messaggi
Ricevi direttamente nella tua mail i nuovi messaggi per
REXML: parsing a string with unescaped ampersand entities
Inserendo la tua e-mail nella casella sotto, riceverai un avviso tramite posta elettronica ogni volta che il motore di ricerca troverà un nuovo messaggio per te
Il servizio è completamente GRATUITO!
x
Login to ForumsZone
Login with Google
Login with E-Mail & Password