[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

[SOLUTION] Quoted Printable (#23

Patrick Hurley

3/15/2005 6:29:00 AM

I am a ruby newbie, so be kind. I wrote the code myself, but blatantly
stole Dave Burt's test cases - thank you. I also found one test case
that breaks my code (and Dave's) that I am not sure what the correct
answer is, but I know mine is wrong:

Consider:
"===
\n"
which will cause a new space to be found at the end of a string - is
it the case that all space at the end of the line is encoded
(increasing size rather needlessly), but simplifying this case? Either
way, I am too tired and have other important stuff to do so I will let
it go.

Please feel free to let me know where I did not do things the "Ruby
way" as I am primarily a C++ and Perl guy, but very interested in
getting better at Ruby.

Thanks
pth


#
# == Synopsis
#
# Ruby Quiz #23
#
# The quoted printable encoding is used in primarily in email, thought it has
# recently seen some use in XML areas as well. The encoding is simple to
# translate to and from.
#
# This week's quiz is to build a filter that handles quoted printable
# translation.
#
# Your script should be a standard Unix filter, reading from files listed on
# the command-line or STDIN and writing to STDOUT. In normal operation, the
# script should encode all text read in the quoted printable format. However,
# your script should also support a -d command-line option and when present,
# text should be decoded from quoted printable instead. Finally, your script
# should understand a -x command-line option and when given, it should encode
# <, > and & for use with XML.
#
# == Usage
#
# ruby quiz23.rb [-d | --decode ] [ -x | --xml ]
#
# == Author
# Patrick Hurley, Cornell-Mayo Assoc
#
# == Copyright
# Copytright (c) 2005 Cornell-Mayo Assoc
# Licensed under the same terms as Ruby.
#

require 'optparse'
require 'rdoc/usage'

module QuotedPrintable
MAX_LINE_PRINTABLE_ENCODE_LENGTH = 76

def from_qp
result = self.gsub(/=\r\n/, "")
result.gsub!(/\r\n/m, $/)
result.gsub!(/=([\dA-F]{2})/) { $1.hex.chr }
result
end

def to_qp(handle_xml = false)
char_mask = if (handle_xml)
/[^!-%,-;=?-~\s]/
else
/[^!-<>-~\s]/
end

# encode the non-space characters
result = self.gsub(char_mask) { |ch| "=%02X" % ch[0] }
# encode the last space character at end of line
result.gsub!(/(\s)(?=#{$/})/o) { |ch| "=%02X" % ch[0] }

lines = result.scan(/(?:(?:[^\n]{74}(?==[\dA-F]{2}))|(?:[^\n]{0,76}(?=\n))|(?:[^\n]{1,75}(?!\n{2})))(?:#{$/}*)/);
lines.join("=\n").gsub(/#{$/}/m, "\r\n")
end

def QuotedPrintable.encode
STDOUT.binmode
while (line = gets) do
print line.to_qp
end
end

def QuotedPrintable.decode
STDIN.binmode
while (line = gets) do
# I am a ruby newbie, and I could
# not get gets to get the \r\n pairs
# no matter how I set $/ - any pointers?
line = line.chomp + "\r\n"
print line.from_qp
end
end

end

class String
include QuotedPrintable
end

if __FILE__ == $0

opts = OptionParser.new
opts.on("-h", "--help") { RDoc::usage; }
opts.on("-d", "--decode") { $decode = true }
opts.on("-x", "--xml") { $handle_xml = true }

opts.parse!(ARGV) rescue RDoc::usage('usage')

if ($decode)
QuotedPrintable.decode()
else
QuotedPrintable.encode()
end
end


8 Answers

Dave Burt

3/15/2005 7:50:00 AM

0

"Patrick Hurley" <phurley@gmail.com> submitted:
>I am a ruby newbie, so be kind. I wrote the code myself, but blatantly
> stole Dave Burt's test cases - thank you. I also found one test case

Quiz tests are for sharing - I think that's established. In any case, you're
welcome to them.

> that breaks my code (and Dave's) that I am not sure what the correct
> answer is, but I know mine is wrong:
>
> Consider:
> "===
> \n"
> which will cause a new space to be found at the end of a string - is
> it the case that all space at the end of the line is encoded
> (increasing size rather needlessly), but simplifying this case? Either
> way, I am too tired and have other important stuff to do so I will let
> it go.

I see no problem. I've added that test case, and both our solutions
pass.

http://www.dave.burt.id.au/ruby/test-quoted-pr...

> Please feel free to let me know where I did not do things the "Ruby
> way" as I am primarily a C++ and Perl guy, but very interested in
> getting better at Ruby.
> ...
> /[^!-<>-~\s]/

Bug: "\f" doesn't get escaped (it's part of /\s/). Probably "\r" as well;
that's harder to test on windows.

I see no other problems. Your optparse is better (i.e. shorter) than mine
:). Your
(/(?:(?:[^\n]{74}(?==[\dA-F]{2}))|(?:[^\n]{0,76}(?=\n))|(?:[^\n]{1,75}(?!\n{2})))(?:#{$/}*)/)
makes you look like a Perl 5 junkie, though. Also, you use global
variables - we rubyists shun these: use locals.

Cheers,
Dave


James Gray

3/15/2005 3:16:00 PM

0

(from Patrick's solution--for those who missed it)

while (line = gets) do
# I am a ruby newbie, and I could
# not get gets to get the \r\n pairs
# no matter how I set $/ - any pointers?
...

James Edward Gray II



Patrick Hurley

3/15/2005 4:43:00 PM

0

Thanks for the kind response.

When I said the test case failed, I meant the actually output our
resulting output encodeing the line has trailing space at the end of a
line. We both escape trailing spaces before we break lines - if the
line breaking moves some code is that not an issue? (the continuation
= might mean that it is not).

Yup there was an issue with masks I fixed that and removed the globals
(my perl just throwing in a $ when in doubt :-) There was also a bug
in the command line driver, which I have fixed. The patched code
follows

> (/(?:(?:[^\n]{74}(?==[\dA-F]{2}))|(?:[^\n]{0,76}(?=\n))|(?:[^\n]{1,75}(?!\n{2})))(?:#{$/}*)/)
> makes you look like a Perl 5 junkie,

I did this to allow the use of a gsub, which is much faster than the
looping solution. The look aheads and general uglyness handle the
special cases. I probably should use /x and space it out and comment,
but when I am in the regexp zone, I know what I am typing <grin>.

#
# == Synopsis
#
# Ruby Quiz #23
#
# The quoted printable encoding is used in primarily in email, thought it has
# recently seen some use in XML areas as well. The encoding is simple to
# translate to and from.
#
# This week's quiz is to build a filter that handles quoted printable
# translation.
#
# Your script should be a standard Unix filter, reading from files listed on
# the command-line or STDIN and writing to STDOUT. In normal operation, the
# script should encode all text read in the quoted printable format. However,
# your script should also support a -d command-line option and when present,
# text should be decoded from quoted printable instead. Finally, your script
# should understand a -x command-line option and when given, it should encode
# <, > and & for use with XML.
#
# == Usage
#
# ruby quiz23.rb [-d | --decode ] [ -x | --xml ]
#
# == Author
# Patrick Hurley, Cornell-Mayo Assoc
#
# == Copyright
# Copytright (c) 2005 Cornell-Mayo Assoc
# Licensed under the same terms as Ruby.
#

require 'optparse'
require 'rdoc/usage'

module QuotedPrintable
MAX_LINE_PRINTABLE_ENCODE_LENGTH = 76

def from_qp
result = self.gsub(/=\r\n/, "")
result.gsub!(/\r\n/m, $/)
result.gsub!(/=([\dA-F]{2})/) { $1.hex.chr }
result
end

def to_qp(handle_xml = false)
char_mask = if (handle_xml)
/[\x00-\x08\x0b-\x1f\x7f-\xff=<>&]/
else
/[\x00-\x08\x0b-\x1f\x7f-\xff=]/
end

# encode the non-space characters
result = self.gsub(char_mask) { |ch| "=%02X" % ch[0] }
# encode the last space character at end of line
result.gsub!(/(\s)(?=#{$/})/o) { |ch| "=%02X" % ch[0] }

lines = result.scan(/(?:(?:[^\n]{74}(?==[\dA-F]{2}))|(?:[^\n]{0,76}(?=\n))|(?:[^\n]{1,75}(?!\n{2})))(?:#{$/}*)/);
lines.join("=\n").gsub(/#{$/}/m, "\r\n")
end

def QuotedPrintable.encode(handle_xml=false)
STDOUT.binmode
while (line = gets) do
print line.to_qp(handle_xml)
end
end

def QuotedPrintable.decode
STDIN.binmode
while (line = gets) do
# I am a ruby newbie, and I could
# not get gets to get the \r\n pairs
# no matter how I set $/ - any pointers?
line = line.chomp + "\r\n"
print line.from_qp
end
end

end

class String
include QuotedPrintable
end

if __FILE__ == $0

decode = false
handle_xml = true
opts = OptionParser.new
opts.on("-h", "--help") { RDoc::usage; }
opts.on("-d", "--decode") { decode = true }
opts.on("-x", "--xml") { handle_xml = true }

opts.parse!(ARGV) rescue RDoc::usage('usage')

if (decode)
QuotedPrintable.decode()
else
QuotedPrintable.encode(handle_xml)
end
end


Dave Burt

3/15/2005 8:40:00 PM

0

"Patrick Hurley" <phurley@gmail.com> continued:
> Thanks for the kind response.
>
> When I said the test case failed, I meant the actually output our
> resulting output encodeing the line has trailing space at the end of a
> line. We both escape trailing spaces before we break lines - if the
> line breaking moves some code is that not an issue? (the continuation
> = might mean that it is not).

From the RFC (2045, section 6.7):
Any TAB (HT) or SPACE characters
on an encoded line MUST thus be followed on that line
by a printable character. In particular, an "=" at the
end of an encoded line, indicating a soft line break
(see rule #5) may follow one or more TAB (HT) or SPACE
characters.

So it's all good - unescaped tabs and spaces are fine as long as it's got a
printable non-whitespace character after it, and "=" is fine for that.

... Therefore, when decoding a Quoted-Printable
body, any trailing white space on a line must be
deleted, as it will necessarily have been added by
intermediate transport agents.

There's something I think we've all forgotten to do -- strip trailing unescaped
whitespace. I've added the following test:

def test_decode_strip_trailing_space
assert_equal(
"The following whitespace must be ignored: \r\n".from_quoted_printable,
"The following whitespace must be ignored:\n")
end

And the following line to decode_string:
result.gsub!(/[\t ]+(?=\r\n|$)/, '')

>
> Yup there was an issue with masks I fixed that and removed the globals
> (my perl just throwing in a $ when in doubt :-) There was also a bug
> in the command line driver, which I have fixed. The patched code
> follows
>
>> (/(?:(?:[^\n]{74}(?==[\dA-F]{2}))|(?:[^\n]{0,76}(?=\n))|(?:[^\n]{1,75}(?!\n{2})))(?:#{$/}*)/)
>> makes you look like a Perl 5 junkie,
>
> I did this to allow the use of a gsub, which is much faster than the
> looping solution. The look aheads and general uglyness handle the
> special cases. I probably should use /x and space it out and comment,
> but when I am in the regexp zone, I know what I am typing <grin>.

Write-only? No, I'm not in a fantastic position to comment, mine is not that
much shorter.

> ...
> def QuotedPrintable.decode
> STDIN.binmode
> while (line = gets) do
> # I am a ruby newbie, and I could
> # not get gets to get the \r\n pairs
> # no matter how I set $/ - any pointers?

| C:\WINDOWS>ruby
| STDIN.binmode
| gets.each_byte do |b| puts b end
| ^Z
|
| 13
| 10
|
Seems to work for me - that output says I wouldn't need the following line

> line = line.chomp + "\r\n"

Cheers,
Dave



Patrick Hurley

3/15/2005 10:15:00 PM

0

Thanks for the update on the RFC, guess I should have just read that myself.

Well I don't want to "litter" the news group, but I hate to have
incorrect code out there with my name on it so. If you want follow the
link (http://hurleyhome.com/~patrick...) to see the fixed code.
Also of note is the now commented (just for Dave) regexp for parsing
long lines, for the curious:

lines = result.scan(/
# Match one of the three following cases
(?:
# This will match the special case of an escape that would generally have
# split across line boundries
(?: [^\n]{74}(?==[\dA-F]{2}) ) |
# This will match the case of a line of text that does not need to split
(?: [^\n]{0,76}(?=\n) ) |
# This will match the case of a line of text that needs to be
split without special adjustment
(?:[^\n]{1,75}(?!\n{2}))
)
# Match zero or more newlines
(?-x:#{$/.}*)/x);

pth


On Wed, 16 Mar 2005 05:40:15 +0900, Dave Burt <dave@burt.id.au> wrote:
> "Patrick Hurley" <phurley@gmail.com> continued:
> > Thanks for the kind response.
> >
> > When I said the test case failed, I meant the actually output our
> > resulting output encodeing the line has trailing space at the end of a
> > line. We both escape trailing spaces before we break lines - if the
> > line breaking moves some code is that not an issue? (the continuation
> > = might mean that it is not).
>
> From the RFC (2045, section 6.7):
> Any TAB (HT) or SPACE characters
> on an encoded line MUST thus be followed on that line
> by a printable character. In particular, an "=" at the
> end of an encoded line, indicating a soft line break
> (see rule #5) may follow one or more TAB (HT) or SPACE
> characters.
>
> So it's all good - unescaped tabs and spaces are fine as long as it's got a
> printable non-whitespace character after it, and "=" is fine for that.
>
> ... Therefore, when decoding a Quoted-Printable
> body, any trailing white space on a line must be
> deleted, as it will necessarily have been added by
> intermediate transport agents.
>
> There's something I think we've all forgotten to do -- strip trailing unescaped
> whitespace. I've added the following test:
>
> def test_decode_strip_trailing_space
> assert_equal(
> "The following whitespace must be ignored: \r\n".from_quoted_printable,
> "The following whitespace must be ignored:\n")
> end
>
> And the following line to decode_string:
> result.gsub!(/[\t ]+(?=\r\n|$)/, '')
>
> >
> > Yup there was an issue with masks I fixed that and removed the globals
> > (my perl just throwing in a $ when in doubt :-) There was also a bug
> > in the command line driver, which I have fixed. The patched code
> > follows
> >
> >> (/(?:(?:[^\n]{74}(?==[\dA-F]{2}))|(?:[^\n]{0,76}(?=\n))|(?:[^\n]{1,75}(?!\n{2})))(?:#{$/}*)/)
> >> makes you look like a Perl 5 junkie,
> >
> > I did this to allow the use of a gsub, which is much faster than the
> > looping solution. The look aheads and general uglyness handle the
> > special cases. I probably should use /x and space it out and comment,
> > but when I am in the regexp zone, I know what I am typing <grin>.
>
> Write-only? No, I'm not in a fantastic position to comment, mine is not that
> much shorter.
>
> > ...
> > def QuotedPrintable.decode
> > STDIN.binmode
> > while (line = gets) do
> > # I am a ruby newbie, and I could
> > # not get gets to get the \r\n pairs
> > # no matter how I set $/ - any pointers?
>
> | C:\WINDOWS>ruby
> | STDIN.binmode
> | gets.each_byte do |b| puts b end
> | ^Z
> |
> | 13
> | 10
> |
> Seems to work for me - that output says I wouldn't need the following line
>
> > line = line.chomp + "\r\n"
>
> Cheers,
> Dave
>
>


Dave Burt

3/16/2005 11:35:00 AM

0

"Florian Gross" <flgr@ccan.de> wrote:
> Matthew Moss wrote:
>
>> Here is my partial solution for the Quoted Printable quiz. I'm still
>> pretty new to Ruby, so it took me a while to get what you see here. I
>> think the only thing I didn't get to adding was line length checks.
>
> And here's mine as well. Sorry for being late -- I coded this up on
> Friday and forgot about it until today.
>
> It ought to handle everything correctly (including proper wrapping of
> lines that end in encoded characters) and it does most of the work with
> a few simple regular expressions.
>

Hi Florian,

As always, I'm amazed by your concise code. But your solution seems to be
failing a bunch of my tests (and not just by chopping lines early, which is
allowed):

encoding:
- escapes mid-line whitespace
- escapes '~'
- allows too-long lines (my tests saw up to 104 characters on a line)
- allows unescaped whitespace on the end of a line (as long as it's preceded
by escaped whitespace)
decoding:
- doesn't ignore trailing literal whitespace

Cheers,
Dave


Florian Gross

3/16/2005 5:33:00 PM

0

Dave Burt wrote:

> Hi Florian,

Moin Dave.

> As always, I'm amazed by your concise code. But your solution seems to be
> failing a bunch of my tests (and not just by chopping lines early, which is
> allowed):

Thanks, I'll have a look.

> encoding:
> - escapes mid-line whitespace

I'm not sure I get this. Am I incorrectly escaping mid-line whitespace
or am I incorrectly not escaping it? And what is mid-line whitespace?

> - escapes '~'

Heh, classic off-by-one. Easily fixed by changing the Regexp. See source
below.

> - allows too-long lines (my tests saw up to 104 characters on a line)

Any hints on when this is happening? I can't see why and when this would
happen.

> - allows unescaped whitespace on the end of a line (as long as it's preceded
> by escaped whitespace)

Fixed. See code below.

> decoding:
> - doesn't ignore trailing literal whitespace

Well, I don't think that's much of an issue as I'm not sure when
trailing whitespace would be prepended to lines, but I've fixed it anyway.

Here's the new code:

> def encode(text, also_encode = "")
> text.gsub(/[\t ](?:[\v\t ]|$)|[=\x00-\x08\x0B-\x1F\x7F-\xFF#{also_encode}]/) do |char|
> char[0 ... -1] + "=%02X" % char[-1]
> end.gsub(/^(.{75})(.{2,})$/) do |match|
> base, continuation = $1, $2
> continuation = base.slice!(/=(.{0,2})\Z/).to_s + continuation
> base + "=\n" + continuation
> end.gsub("\n", "\r\n")
> end
>
> def decode(text, allow_lowercase = false)
> encoded_re = Regexp.new("=([0-9A-F]{2})", allow_lowercase ? "i" : "")
> text.gsub("\r\n", "\n").gsub("=\n", "").gsub(encoded_re) do
> $1.to_i(16).chr
> end
> end

I'll repost the full source when I've sorted out that other problem as well.

Dave Burt

3/17/2005 12:15:00 AM

0

"Florian Gross" <flgr@ccan.de> responded
> Dave Burt wrote:
>
>> Hi Florian,
>
> Moin Dave.
>
>> As always, I'm amazed by your concise code. But your solution seems to be
>> failing a bunch of my tests (and not just by chopping lines early, which
>> is allowed):
>
> Thanks, I'll have a look.
>
>> encoding:
>> - escapes mid-line whitespace
>
> I'm not sure I get this. Am I incorrectly escaping mid-line whitespace or
> am I incorrectly not escaping it? And what is mid-line whitespace?

Tabs and spaces that are followed by something printable on the same line
should not be escaped; see the following:

5) Failure:
test_encode_12(TC_QuotedPrintable) [(eval):2]:
<"=3D=3D=3D
=\r\n =20\r\n"> expected but was
<"=3D=3D=3D=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
=\r\n=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
=20 =20 =20 =20 =20 =20 =20 =20\r\n">.

>> - escapes '~'
>
> Heh, classic off-by-one. Easily fixed by changing the Regexp. See source
> below.

Too easy :)

>> - allows too-long lines (my tests saw up to 104 characters on a line)
>
> Any hints on when this is happening? I can't see why and when this would
> happen.

test_encode_12 also demonstrates this. I fixed it by changing
/[\t ](?:[\v\t ]|$)../ to /[\t ]$../.
This (obviously) fixes the mid-line whitespace as well.

>> - allows unescaped whitespace on the end of a line (as long as it's
>> preceded by escaped whitespace)
>
> Fixed. See code below.
>
>> decoding:
>> - doesn't ignore trailing literal whitespace
>
> Well, I don't think that's much of an issue as I'm not sure when trailing
> whitespace would be prepended to lines, but I've fixed it anyway.

It's not mentioned in the quiz question, although you can infer that it is
illegal from the quiz question. The idea is that if there is trailing
whitespace, it has been added in transit and should be removed (it's not
actually part of the data that was encoded).

Also, this, on line 10: "char[0 ... -1] + ...", seems redundant - with char
as a one-character match, it's an empty string.

> Here's the new code:
>
>> <snip>
>
> I'll repost the full source when I've sorted out that other problem as
> well.

Cheers,
Dave