[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

turning a non-ASCII character into a XML entity with REXML?

Francis Hwang

10/16/2004 12:39:00 AM

I asked this a little while back but maybe didn't ask the right way, so
maybe somebody can help me if I rephrase:

I'm trying to build an RSS feed that takes, in its item descriptions,
ISO-8859-1 text. (I'm using REXML for now.) I'd like to be able to take
a non-ASCII character and turn it into a usable XML entity. So, for
example, "\251" would get turned into "&#169":

str = "\251 2004 Francis Hwang"
elt = REXML::Element.new( 'elt' )
elt.text = str
elt.to_s
=> "<elt>\251 2004 Francis Hwang</elt>"
# But I want "<elt>&#169; 2004 Francis Hwang</elt>"

Is there some sort of setting I can twiddle in REXML so that I can
assign a text that includes these sorts of characters, and REXML will
know to turn them into entities on output? I know I can do this by hand
and then prevent escaping by use the :raw flag, but I'd like to avoid
that if possible.

Francis



3 Answers

Patrick May

10/16/2004 6:16:00 AM

0

On Friday, October 15, 2004, at 08:38 PM, Francis Hwang wrote:

> I asked this a little while back but maybe didn't ask the right way,
> so maybe somebody can help me if I rephrase:
>
> I'm trying to build an RSS feed that takes, in its item descriptions,
> ISO-8859-1 text. (I'm using REXML for now.) I'd like to be able to
> take a non-ASCII character and turn it into a usable XML entity. So,
> for example, "\251" would get turned into "&#169":
>
> str = "\251 2004 Francis Hwang"
> elt = REXML::Element.new( 'elt' )
> elt.text = str
> elt.to_s
> => "<elt>\251 2004 Francis Hwang</elt>"
> # But I want "<elt>&#169; 2004 Francis Hwang</elt>"
>
> Is there some sort of setting I can twiddle in REXML so that I can
> assign a text that includes these sorts of characters, and REXML will
> know to turn them into entities on output? I know I can do this by
> hand and then prevent escaping by use the :raw flag, but I'd like to
> avoid that if possible.

I think there's an escapeHTML function on the CGI that might do it. Of
course, it will also hit the &gt; and &lt;. You could still lift the
code from there.

~ pat



Francis Hwang

10/16/2004 3:05:00 PM

0

I just tried; it doesn't do it.

irb(main):004:0> CGI.escapeHTML( "<br>")
=> "&lt;br&gt;"
irb(main):005:0> CGI.escapeHTML( "<br>\251")
=> "&lt;br&gt;\251"


On Oct 16, 2004, at 2:15 AM, Patrick May wrote:

> On Friday, October 15, 2004, at 08:38 PM, Francis Hwang wrote:
>
>> I asked this a little while back but maybe didn't ask the right way,
>> so maybe somebody can help me if I rephrase:
>>
>> I'm trying to build an RSS feed that takes, in its item descriptions,
>> ISO-8859-1 text. (I'm using REXML for now.) I'd like to be able to
>> take a non-ASCII character and turn it into a usable XML entity. So,
>> for example, "\251" would get turned into "&#169":
>>
>> str = "\251 2004 Francis Hwang"
>> elt = REXML::Element.new( 'elt' )
>> elt.text = str
>> elt.to_s
>> => "<elt>\251 2004 Francis Hwang</elt>"
>> # But I want "<elt>&#169; 2004 Francis Hwang</elt>"
>>
>> Is there some sort of setting I can twiddle in REXML so that I can
>> assign a text that includes these sorts of characters, and REXML will
>> know to turn them into entities on output? I know I can do this by
>> hand and then prevent escaping by use the :raw flag, but I'd like to
>> avoid that if possible.
>
> I think there's an escapeHTML function on the CGI that might do it.
> Of course, it will also hit the &gt; and &lt;. You could still lift
> the code from there.
>
> ~ pat
>
>



Brian Candler

10/19/2004 3:45:00 PM

0

| I'm trying to build an RSS feed that takes, in its item descriptions,
| ISO-8859-1 text. (I'm using REXML for now.) I'd like to be able to take
| a non-ASCII character and turn it into a usable XML entity. So, for
| example, "\251" would get turned into "&#169"

Not exactly what you're asking for, but you could use Iconv to convert
ISO-8859-1 into UTF-8. It should be perfectly legal to include UTF-8
characters directly in XML, without turning them into character entities.

Alternatively, if it's sufficient to convert characters 160-255 straight
into numeric entity refs (which works if the top half of ISO-8859-1 maps
directly into Unicode, as I think it does), then how about

a = "Copyright \251 2004"
a.gsub!(/[\240-\377]/) { |c| "&#%d;" % c[0] }

# => "Copyright &#169; 2004"

Regards,

Brian.