[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Enhancing the Gateway (Help Needed

James Gray

10/28/2007 9:21:00 PM

Here's the short-story on the current situation with our mailing list
to Usenet gateway:

* Our Usenet host rejects multipart/alternative messages
because they are technically illegal Usenet posts
* This means that some emails do not reach comp.lang.ruby
(several messages each day according to the logs)
* We don't like this

To solve this, we want to enhance the gateway to convert multipart/
alternative messages into something we can legally post to Usenet. I
have two thoughts on this strategy:

1. If possible, we should gather all text/plain portions of an email
and post those with a content-type of text/plain
2. If that fails, we can just post the original body but force the
content-type to text/plain for maximum compatibility

Now I need all of you email and Usenet experts to tell me if that's a
sane strategy. If another approach would be better, please clue me in.

I've pretty much made it this far. The code at the bottom of this
message is the mail_to_news.rb script used by the gateway rewritten
using this strategy.

If you aren't familiar with the gateway code, you can get details
from the articles at:

http://blog.grayproductions.net/categories/t...

There's one problem left I know I haven't solved correctly. Help me
figure out a decent strategy for this last piece and we can deploy
the new code.

The outstanding issue is how to handle character sets for the
constructed message. You'll see in the code below that I just pull
the charset param from the original message, but after looking at a
few messages, I realize that this doesn't make sense. For example,
here are the relevant portions of a recent post that wasn't gated
correctly:

Content-Type: multipart/alternative; boundary=Apple-Mail-18-445454026

--Apple-Mail-18-445454026
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=US-ASCII;
delsp=yes;
format=flowed

As you can see, the overall email doesn't have a charset but each
text portion can. If we are going to merge these parts, what's the
best strategy for handling the charset?

I thought of trying to convert them all to UTF-8 with Iconv, but I'm
not sure what to do if a type doesn't declare a charset or when Iconv
chokes on what is declared? Please share your opinions.

If you are feeling really adventurous, rewrite the relevant portion
of the code below which I will bracket with a FIX ME comments.

Here's the script:

#!/usr/bin/env ruby

# written by James Edward Gray II <james@grayproductions.net>

$KCODE = "u"

GATEWAY_DIR = File.join(File.dirname(__FILE__), "..").freeze

$LOAD_PATH << File.join(GATEWAY_DIR, "config") << File.join
(GATEWAY_DIR, "lib")

require "tmail"

require "servers_config"
require "nntp"

require "logger"
require "timeout"

# prepare log
log = Logger.new(ARGV.shift || $stdout)
log.datetime_format = "%Y-%m-%d %H:%M "

# build incoming and outgoing message object
incoming = TMail::Mail.parse($stdin.read)
outgoing = TMail::Mail.new

# skip any flagged messages
if incoming["X-Rubymirror"].to_s == "yes"
log.info "Skipping message ##{incoming.message_id}, sent by
news_to_mail"
exit
elsif incoming["X-Spam-Status"].to_s =~ /\AYes/
log.info "Ignoring Spam ##{incoming.message_id}: " +
"#{incoming.subject}–#{incoming.from}"
exit
end

# only allow certain headers through
%w[from subject in_reply_to transfer_encoding date].each do |header|
outgoing.send("#{header}=", incoming.send(header))
end
outgoing.message_id = incoming.message_id.sub(/\.+>$/, ">")
%w[X-ML-Name X-Mail-Count X-X-Sender].each do |header|
outgoing[header] = incoming[header].to_s if incoming.key?header
end

# doctor headers for Ruby Talk
outgoing.references = if incoming.key? "References"
incoming.references
else
if incoming.key? "In-Reply-To"
incoming.reply_to
else
if incoming.subject =~ /^Re:/
outgoing.reply_to = "<this_is_a_dummy_message-id@rubygateway>"
end
end
end
outgoing["X-Ruby-Talk"] = incoming.message_id
outgoing["X-Received-From"] = <<END_GATEWAY_DETAILS.gsub(/\s+/, " ")
This message has been automatically forwarded from the ruby-talk
mailing list by
a gateway at #{ServersConfig::NEWSGROUP}. If it is SPAM, it did not
originate at
#{ServersConfig::NEWSGROUP}. Please report the original sender, and
not us.
Thanks! For more details about this gateway, please visit:
http://blog.grayproductions.net/categories/t...
END_GATEWAY_DETAILS
outgoing["X-Rubymirror"] = "Yes"

# translate the body of the message, if needed
if incoming.multipart? and incoming.sub_type == "alternative"
### FIX ME ###
# handle multipart/alternative messages
# extract body
body = ""
extract_text = lambda do |message_or_part|
if message_or_part.multipart?
message_or_part.each_part { |part| extract_text[part] }
elsif message_or_part.content_type == "text/plain"
body += message_or_part.body
end
end
extract_text[incoming]
if body.empty?
outgoing.body = "Note: the content-type of this message was
altered by " +
"the gateway.\n\n#{incoming.body}"
else
outgoing.body = "Note: non-text portions of this message were
stripped " +
"by the gateway.\n\n#{body}"
end
# set the content type of the new message
outgoing.set_content_type( "text", "plain",
"charset" => incoming.type_param
("charset") )
### END FIX ME ###
else
%w[content_type body].each do |header|
outgoing.send("#{header}=", incoming.send(header))
end
end

log.info "Sending message ##{incoming.message_id}: " +
"#{incoming.subject}–#{incoming.from}…"
log.info "Message looks like:\n#{outgoing.encoded}"

# connect to NNTP host
begin
nntp = nil
Timeout.timeout(30) do
nntp = Net::NNTP.new( ServersConfig::NEWS_SERVER,
Net::NNTP::NNTP_PORT,
ServersConfig::NEWS_USER,
ServersConfig::NEWS_PASS )
end
rescue Timeout::Error
log.error "The NNTP connection timed out"
exit -1
rescue
log.fatal "Unable to establish connection to NNTP host: #{$!.message}"
exit -1
end

# attempt to send newsgroup post
unless $DEBUG
begin
result = nil
Timeout.timeout(30) { result = nntp.post(outgoing.encoded) }
rescue Timeout::Error
log.error "The NNTP post timed out"
exit -1
rescue
log.fatal "Unable to post to NNTP host: #{$!.message}"
exit -1
end
log.info "… Sent. nntp.post() result: #{result}"
end

__END__

Thanks for the help.

James Edward Gray II


24 Answers

Bill Kelly

10/28/2007 11:40:00 PM

0


From: "James Edward Gray II" <james@grayproductions.net>
>
> 1. If possible, we should gather all text/plain portions of an email
> and post those with a content-type of text/plain

Do we get many HTML-only messages, having a text/html part, without a
corresponding text/plain part?

Or is that too uncommon to worry about?


Regards,

Bill



Nobuyoshi Nakada

10/29/2007 3:01:00 AM

0

Hi,

At Mon, 29 Oct 2007 06:20:48 +0900,
James Edward Gray II wrote in [ruby-talk:276334]:
> To solve this, we want to enhance the gateway to convert multipart/
> alternative messages into something we can legally post to Usenet. I
> have two thoughts on this strategy:
>
> 1. If possible, we should gather all text/plain portions of an email
> and post those with a content-type of text/plain

Rather I want it to be done by FML itself on ruyb-lang.org.

> 2. If that fails, we can just post the original body but force the
> content-type to text/plain for maximum compatibility

I do it locally by `w3m -dump -T text/html`.

> The outstanding issue is how to handle character sets for the
> constructed message. You'll see in the code below that I just pull
> the charset param from the original message, but after looking at a
> few messages, I realize that this doesn't make sense. For example,
> here are the relevant portions of a recent post that wasn't gated
> correctly:
>
> Content-Type: multipart/alternative; boundary=Apple-Mail-18-445454026
>
> --Apple-Mail-18-445454026
> Content-Transfer-Encoding: 7bit
> Content-Type: text/plain;
> charset=US-ASCII;
> delsp=yes;
> format=flowed
>
> As you can see, the overall email doesn't have a charset but each
> text portion can. If we are going to merge these parts, what's the
> best strategy for handling the charset?

"alternative" means each bodies have actually same contents,
so, in theoretically, you can and should select one of them.
Merging them all is wrong behavior. I suspect you mean
multipart/relative.

> I thought of trying to convert them all to UTF-8 with Iconv, but I'm
> not sure what to do if a type doesn't declare a charset or when Iconv
> chokes on what is declared? Please share your opinions.

Should be defaulted to US-ASCII.

--
Nobu Nakada

James Gray

10/29/2007 3:19:00 AM

0

On Oct 28, 2007, at 10:00 PM, Nobuyoshi Nakada wrote:

> Hi,
>
> At Mon, 29 Oct 2007 06:20:48 +0900,
> James Edward Gray II wrote in [ruby-talk:276334]:
>> To solve this, we want to enhance the gateway to convert multipart/
>> alternative messages into something we can legally post to Usenet. I
>> have two thoughts on this strategy:
>>
>> 1. If possible, we should gather all text/plain portions of an email
>> and post those with a content-type of text/plain
>
> Rather I want it to be done by FML itself on ruyb-lang.org.

Excellent. Are their any plans to make that happen?

I'm trying to get it in the gateway so we can stop having this
discussion. ;) But if there are plans to have the list itself do
it, that's great.

>> 2. If that fails, we can just post the original body but force the
>> content-type to text/plain for maximum compatibility
>
> I do it locally by `w3m -dump -T text/html`.

Yes, I assume we could use lynx/links to similar effect. My strategy
wasn't as clever, but I thought by swapping the content type we would
at least get the content, though it would have some noise.

>> The outstanding issue is how to handle character sets for the
>> constructed message. You'll see in the code below that I just pull
>> the charset param from the original message, but after looking at a
>> few messages, I realize that this doesn't make sense. For example,
>> here are the relevant portions of a recent post that wasn't gated
>> correctly:
>>
>> Content-Type: multipart/alternative; boundary=Apple-
>> Mail-18-445454026
>>
>> --Apple-Mail-18-445454026
>> Content-Transfer-Encoding: 7bit
>> Content-Type: text/plain;
>> charset=US-ASCII;
>> delsp=yes;
>> format=flowed
>>
>> As you can see, the overall email doesn't have a charset but each
>> text portion can. If we are going to merge these parts, what's the
>> best strategy for handling the charset?
>
> "alternative" means each bodies have actually same contents,
> so, in theoretically, you can and should select one of them.
> Merging them all is wrong behavior.

Now you know why I asked for help. I know so little about email
rules. Thanks for explaining this.

This is good news because it greatly simplifies the process.

Do you know if multipart content can be nested? For example, could a
single part of a multipart message itself be multipart? The design
of TMail seems to support this, but again it's easier if that's not
the case.

> I suspect you mean multipart/relative.

I wasn't even aware of that format, to be honest. I knew of
multipart/mixed (which our Usenet host will allow) and multipart/
alternative. What is the purpose of multipart/relative?

>> I thought of trying to convert them all to UTF-8 with Iconv, but I'm
>> not sure what to do if a type doesn't declare a charset or when Iconv
>> chokes on what is declared? Please share your opinions.
>
> Should be defaulted to US-ASCII.

Do you mean that US-ASCII is the charset when one is not specified?

Thanks for all for the information.

James Edward Gray II


James Gray

10/29/2007 3:23:00 AM

0

On Oct 28, 2007, at 6:39 PM, Bill Kelly wrote:

>
> From: "James Edward Gray II" <james@grayproductions.net>
>> 1. If possible, we should gather all text/plain portions of an
>> email and post those with a content-type of text/plain
>
> Do we get many HTML-only messages, having a text/html part, without a
> corresponding text/plain part?

I know I have seen it at least once in the past. I suspect it's
rare, but that's just me guessing. When dealing with the Internet at
large, I think we always need to be prepared for the worst case
scenario.

> Or is that too uncommon to worry about?

You made a good point here that I should try looking at some actual
Ruby Talk messages to see what we're up against. I'll put together a
script to comb through a subset of the archives…

James Edward Gray II

Nobuyoshi Nakada

10/29/2007 4:17:00 AM

0

Hi,

At Mon, 29 Oct 2007 12:18:40 +0900,
James Edward Gray II wrote in [ruby-talk:276357]:
> >> 1. If possible, we should gather all text/plain portions of an email
> >> and post those with a content-type of text/plain
> >
> > Rather I want it to be done by FML itself on ruyb-lang.org.
>
> Excellent. Are their any plans to make that happen?

I'm asking to eban.

> Do you know if multipart content can be nested? For example, could a
> single part of a multipart message itself be multipart? The design
> of TMail seems to support this, but again it's easier if that's not
> the case.

Yes, and the depth isn't restricted.

> > I suspect you mean multipart/relative.
>
> I wasn't even aware of that format, to be honest. I knew of
> multipart/mixed (which our Usenet host will allow) and multipart/
> alternative. What is the purpose of multipart/relative?

As the above.

> >> I thought of trying to convert them all to UTF-8 with Iconv, but I'm
> >> not sure what to do if a type doesn't declare a charset or when Iconv
> >> chokes on what is declared? Please share your opinions.
> >
> > Should be defaulted to US-ASCII.
>
> Do you mean that US-ASCII is the charset when one is not specified?

RFC 2045 Internet Message Bodies November 1996

5.2. Content-Type Defaults

Default RFC 822 messages without a MIME Content-Type header are taken
by this protocol to be plain text in the US-ASCII character set,
which can be explicitly specified as:

Content-type: text/plain; charset=us-ascii

This default is assumed if no Content-Type header field is specified.

--
Nobu Nakada

Nobuyoshi Nakada

10/29/2007 4:35:00 AM

0

Hi,

At Mon, 29 Oct 2007 13:17:24 +0900,
Nobuyoshi Nakada wrote in [ruby-talk:276371]:
> > > I suspect you mean multipart/relative.
> >
> > I wasn't even aware of that format, to be honest. I knew of
> > multipart/mixed (which our Usenet host will allow) and multipart/
> > alternative. What is the purpose of multipart/relative?
>
> As the above.

Oops, it was multipart/related, and I removed the paragraph
mentioned about it. My mistake, sorry.

--
Nobu Nakada

James Gray

10/29/2007 1:10:00 PM

0

On Oct 28, 2007, at 11:35 PM, Nobuyoshi Nakada wrote:

> Hi,
>
> At Mon, 29 Oct 2007 13:17:24 +0900,
> Nobuyoshi Nakada wrote in [ruby-talk:276371]:
>>>> I suspect you mean multipart/relative.
>>>
>>> I wasn't even aware of that format, to be honest. I knew of
>>> multipart/mixed (which our Usenet host will allow) and multipart/
>>> alternative. What is the purpose of multipart/relative?
>>
>> As the above.
>
> Oops, it was multipart/related, and I removed the paragraph
> mentioned about it. My mistake, sorry.

I've been looking into this a little this morning.

We do receive multipart/related messages, though they seem fairly
uncommon compared to multipart/alternative. They don't appear to be
gated properly. In fact, the mailing list archives don't even seem
to show them. For example 271796 was a multipart/related message and
I can't find it in the archives or on comp.lang.ruby.

To understand what we are dealing with here, I read:

http://www.faqs.org/rfcs/rf...

This type does not seem easy to deal with and I open to suggestions
for the best strategy to use.

James Edward Gray II


mortee

10/29/2007 2:20:00 PM

0

Todd Benson

10/29/2007 3:02:00 PM

0

On 10/29/07, James Edward Gray II <james@grayproductions.net> wrote:
> On Oct 28, 2007, at 11:35 PM, Nobuyoshi Nakada wrote:
>
> > Hi,
> >
> > At Mon, 29 Oct 2007 13:17:24 +0900,
> > Nobuyoshi Nakada wrote in [ruby-talk:276371]:
> >>>> I suspect you mean multipart/relative.
> >>>
> >>> I wasn't even aware of that format, to be honest. I knew of
> >>> multipart/mixed (which our Usenet host will allow) and multipart/
> >>> alternative. What is the purpose of multipart/relative?
> >>
> >> As the above.
> >
> > Oops, it was multipart/related, and I removed the paragraph
> > mentioned about it. My mistake, sorry.
>
> I've been looking into this a little this morning.
>
> We do receive multipart/related messages, though they seem fairly
> uncommon compared to multipart/alternative. They don't appear to be
> gated properly. In fact, the mailing list archives don't even seem
> to show them. For example 271796 was a multipart/related message and
> I can't find it in the archives or on comp.lang.ruby.
>
> To understand what we are dealing with here, I read:
>
> http://www.faqs.org/rfcs/rf...
>
> This type does not seem easy to deal with and I open to suggestions
> for the best strategy to use.
>
> James Edward Gray II

I haven't built enough clout in this group for my opinion to matter,
but here goes...

James did a great job with the gateway ... no doubt about that.
Should we even have it? I absolutely think so.

The lowest common denominator for language is US-ASCII (is that a good
thing or bad thing? You decide).

Make sure, James and others, that you label the reformed
emails/postings with some kind of rejoinder that says something to the
effect of "mail/posting has been modified to make it available."

Todd

James Gray

10/29/2007 3:07:00 PM

0

On Oct 29, 2007, at 9:20 AM, mortee wrote:

> James Edward Gray II wrote:
>> I've been looking into this a little this morning.
>>
>> We do receive multipart/related messages, though they seem fairly
>> uncommon compared to multipart/alternative. They don't appear to be
>> gated properly. In fact, the mailing list archives don't even
>> seem to
>> show them. For example 271796 was a multipart/related message and I
>> can't find it in the archives or on comp.lang.ruby.
>>
>> To understand what we are dealing with here, I read:
>>
>> http://www.faqs.org/rfcs/rf...
>>
>> This type does not seem easy to deal with and I open to
>> suggestions for
>> the best strategy to use.
>
> AFAIK it's mostly used for HTML messages with images embedded in the
> email itself.

Yeah, I think that's what I'm seeing in my analysis of the messages.

> I guess it would mostly be one part of a multipart/alternative
> message, of which one alternative should be text/plain anyway.

Most of the cases I have found have a multipart/alternative section
inside the multipart/related section, like this example shows:

271796: multipart/related ()
multipart/alternative ()
image/png ()

Obviously I need to extend my statistics gathering script to handle
the nesting, but I've checked this message by hand and there was a
text/plain part in there.

> Otherwise, you're most likely left with HTML to
> strip, and images which you may either drop or attach to the output as
> files.

Right. Which means I still need to settle on an HTML strategy as well.

> Sorry if I happen to be wrong on one point or the other.

The other usage that seems common, more common than the HTML case in
fact, is as part of a signed message:

271822: multipart/signed ()
multipart/related ()
application/pgp-signature ()

I've not yet checked to see if these messages are gated properly with
our current setup.

James Edward Gray II