Asp Forum - Should HTML entity translation accept "&"?

John Nagle

1/7/2008 1:10:00 AM

Another in our ongoing series on "Parsing Real-World HTML".

It's wrong, of course. But Firefox will accept as HTML escapes

&amp
&gt
&lt

as well as the correct forms

&
>
<

To be "compatible", a Python screen scraper at

http://zesty.ca/python...

has a function "htmldecode", which is supposed to recognize
HTML escapes and generate Unicode. (Why isn't this a standard
Python library function? Its inverse is available.)

This uses the regular expression

charrefpat = re.compile(r'&(#(\d+|x[\da-fA-F]+)|[\w.:-]+);?',re.UNICODE)

to recognize HTML escapes.

Note the ";?", which makes the closing ";" optional.

This seems fine until we hit something valid but unusual like

http://www.example.com?foo=1&am...

for which "htmldecode" tries to convert "1234567" into
a Unicode character with that decimal number, and gets a
Unicode overflow.

For our own purposes, I rewrote "htmldecode" to require a
sequence ending in ";", which means some bogus HTML escapes won't
be recognized, but correct HTML will be processed correctly.
What's general opinion of this behavior? Too strict, or OK?

John Nagle
SiteTruth

3 Answers

Ben Finney

1/7/2008 1:25:00 AM

John Nagle <nagle@animats.com> writes:

> For our own purposes, I rewrote "htmldecode" to require a sequence
> ending in ";", which means some bogus HTML escapes won't be
> recognized, but correct HTML will be processed correctly. What's
> general opinion of this behavior? Too strict, or OK?

I think it's fine. In the face of ambiguity (and deviation from the
published standards), refuse the temptation to guess.

More specifically, I don't see any reason to contort your code to
understand some non-entity sequence that would be flagged as invalid
by HTML validator tools.

--
\ "Those who write software only for pay should go hurt some |
`\ other field." -- Erik Naggum, in _gnu.misc.discuss_ |
_o__) |
Ben Finney

Steven D'Aprano

1/7/2008 3:56:00 AM

On Mon, 07 Jan 2008 12:25:07 +1100, Ben Finney wrote:

> John Nagle <nagle@animats.com> writes:
>
>> For our own purposes, I rewrote "htmldecode" to require a sequence
>> ending in ";", which means some bogus HTML escapes won't be recognized,
>> but correct HTML will be processed correctly. What's general opinion of
>> this behavior? Too strict, or OK?
>
> I think it's fine. In the face of ambiguity (and deviation from the
> published standards), refuse the temptation to guess.

That's good advice for a library function. But...

> More specifically, I don't see any reason to contort your code to
> understand some non-entity sequence that would be flagged as invalid by
> HTML validator tools.

.... it is questionable advice for a program which is designed to make
sense of invalid HTML.

Like it or not, real-world applications sometimes have to work with bad
data. I think we can all agree that the world would have been better off
if the major browsers had followed your advice, but given that they do
not, and thus leave open the opportunity for websites to exist with
invalid HTML, John is left in the painful position of having to write
code that has to make sense of invalid HTML.

I think only John can really answer his own question. What are the
consequences of false positives versus false negatives? If it raises an
exception, can he shunt the code to another function and use some
heuristics to make sense of it, or is it "game over, another site can't
be analyzed"?

--
Steven

Paddy

1/7/2008 8:14:00 AM

On Jan 7, 1:09 am, John Nagle <na...@animats.com> wrote:
> Another in our ongoing series on "Parsing Real-World HTML".
>
> It's wrong, of course. But Firefox will accept as HTML escapes
>
> &amp
> &gt
> &lt
>
> as well as the correct forms
>
> &
> >
> <
>
> To be "compatible", a Python screen scraper at
>
> http://zesty.ca/python...
>
> has a function "htmldecode", which is supposed to recognize
> HTML escapes and generate Unicode. (Why isn't this a standard
> Python library function? Its inverse is available.)
>
> This uses the regular expression
>
> charrefpat = re.compile(r'&(#(\d+|x[\da-fA-F]+)|[\w.:-]+);?',re.UNICODE)
>
> to recognize HTML escapes.
>
> Note the ";?", which makes the closing ";" optional.
>
> This seems fine until we hit something valid but unusual like
>
> http://www.example.c...
>
> for which "htmldecode" tries to convert "1234567" into
> a Unicode character with that decimal number, and gets a
> Unicode overflow.
>
> For our own purposes, I rewrote "htmldecode" to require a
> sequence ending in ";", which means some bogus HTML escapes won't
> be recognized, but correct HTML will be processed correctly.
> What's general opinion of this behavior? Too strict, or OK?
>
> John Nagle
> SiteTruth

Maybe htmltidy could help:
http://tidy.source...
?

comp.lang.python

Should HTML entity translation accept "&"?

John Nagle

Ben Finney

Steven D'Aprano

Paddy

x Login to ForumsZone