[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Ruby method to strip out XML codes?

Michael W. Ryder

12/6/2007 1:14:00 AM

I am trying to process an XML file that includes various codes. The
problem I am running into is that some of these codes are inserted into
the middle of an encrypted string. If I display the file using a
browser these codes do not show up and copying and pasting the string
work fine. The problem occurs when I try to strip out the string in a
program and these "extraneous" XML codes are included. This of course
makes the decryption routine crash.
What I am looking for is a simple way to read through the file and
remove all the XML codes leaving just plain text. I could probably
write a series of regular expressions to remove each code that I can
find in my text but am afraid I might miss some and it will come back to
haunt me at a later time.
2 Answers

Phrogz

12/6/2007 3:19:00 AM

0

On Dec 5, 6:13 pm, "Michael W. Ryder" <_mwry...@worldnet.att.net>
wrote:
> I am trying to process an XML file that includes various codes. The
> problem I am running into is that some of these codes are inserted into
> the middle of an encrypted string. If I display the file using a
> browser these codes do not show up and copying and pasting the string
> work fine. The problem occurs when I try to strip out the string in a
> program and these "extraneous" XML codes are included. This of course
> makes the decryption routine crash.
> What I am looking for is a simple way to read through the file and
> remove all the XML codes leaving just plain text. I could probably
> write a series of regular expressions to remove each code that I can
> find in my text but am afraid I might miss some and it will come back to
> haunt me at a later time.

str.gsub /</?[^>]+>/, ''

This will only be a problem if your XML file is legal and has a CDATA
section which has a literal < character (not &lt;), like:

for ( var i=0, len=a.length; i<len; ++i )

In that case you likely want a proper XML parser (like REXML) and to
use it.

Do you really want to remove the XML, or would it suffice to just:

str.gsub! '&', '&amp;'
str.gsub! '<', '&lt;'
str.gsub! '>', '&gt;'
(and maybe even)
str.gsub! '"', '&quot;'
str.gsub! "'", '&apos;'

to make your string valid and escaped for use in an HTML context?

Michael W. Ryder

12/6/2007 7:53:00 AM

0

Phrogz wrote:
> On Dec 5, 6:13 pm, "Michael W. Ryder" <_mwry...@worldnet.att.net>
> wrote:
>> I am trying to process an XML file that includes various codes. The
>> problem I am running into is that some of these codes are inserted into
>> the middle of an encrypted string. If I display the file using a
>> browser these codes do not show up and copying and pasting the string
>> work fine. The problem occurs when I try to strip out the string in a
>> program and these "extraneous" XML codes are included. This of course
>> makes the decryption routine crash.
>> What I am looking for is a simple way to read through the file and
>> remove all the XML codes leaving just plain text. I could probably
>> write a series of regular expressions to remove each code that I can
>> find in my text but am afraid I might miss some and it will come back to
>> haunt me at a later time.
>
> str.gsub /</?[^>]+>/, ''
>
> This will only be a problem if your XML file is legal and has a CDATA
> section which has a literal < character (not &lt;), like:
>
> for ( var i=0, len=a.length; i<len; ++i )
>
> In that case you likely want a proper XML parser (like REXML) and to
> use it.
>
> Do you really want to remove the XML, or would it suffice to just:
>
> str.gsub! '&', '&amp;'
> str.gsub! '<', '&lt;'
> str.gsub! '>', '&gt;'
> (and maybe even)
> str.gsub! '"', '&quot;'
> str.gsub! "'", '&apos;'
>
> to make your string valid and escaped for use in an HTML context?

My problem is that the XML file includes &#xD;&#xA; in the middle of a
couple of fields, especially in the encrypted fields. If I just strip
out the encrypted field and try to decrypt it the program crashes as the
key is invalid. I have to remove the "bad" character strings before
sending it to my decryption program. I would prefer to do this removal
before sending the file to my programs so that I don't have to deal with
these codes.
I assume that the string I am seeing is XML's way of saying CR/LF as DA
in hex is CR/LF and the output in a browser shows the field being broken
at that point. The problem is that is only the ones that I have noticed
and there may be others hiding in the data. The XML file is being
parsed for conversion to our accounts.