James Kanze
11/23/2008 1:58:00 PM
On Nov 22, 8:58 pm, Juha Nieminen <nos...@thanks.invalid> wrote:
> doublemaster...@gmail.com wrote:
> > Can we have Puzzle thread here?? If any one has a
> > interesting C++ question which helps to understand c++
> > better or makes interview easier to face can post here..
> This is not a question which really helps understanding C++
> better nor is a good job interview question (well, not unless
> you are applying for a job which involves writing a compiler),
> but I think it's interesting nevertheless:
> C++ is very hard to parse because its syntax is not a
> so-called context-free grammar. Give an example (one full
> sentence, ie. a full expression ending in a semi-colon) of
> valid code which cannot be unambiguously tokenized properly
> without knowing the environment in which the line of code
> appears (ie. everything else in the same compilation unit). In
> other words, it would be possible to tokenize the sentence in
> at least two completely different ways, and both ways could be
> valid C++ code (if in the proper environment).
> (Note that tokenizing a sentence doesn't require understanding
> the semantics of the expression, ie. it's not necessary to
> know eg. if some type name has been declared earlier or not.
> Tokenizing simply means that the sentence is divided into its
> constituent tokens, each token having a well-defined type, eg.
> "identifier", "unary operator", "binary operator", "opening
> parenthesis", etc.)
The problem with that is that the question is ambiguous: what do
you mean by a token? (As for your "well-defined type", that's a
meaningless statement until you know how the compiler internals
are implemented.)
Formally, C++ defines tokens so that you can always "tokenize"
with at most one character look-ahead (is the next character
part of this token, or not), and no context. Practically,
internally, it's impossible to parse C++ if you don't separate
symbols into names of types, names of templates, and other, and
I imagine that most compilers treat these as separate tokens.
Similarly, it's probably advantageous to distinguish between the
> which closes a template and the > which is the operator less
than; with the new standard, I suspect that the simplest
implementation would also distinguish between a >> which closes
two templates (which is formally a single token which is then
remapped to two---but if you know that the context would allow
the remapping, you could do it immediately in the tokenizing
phase) and the right shift operator.
So formally, there aren't any, but internally, there could be,
and in fact, probably are. (In practice, I would be very
surprised if there were any compilers which didn't use context
to return different token types for type names, template names
and other symbols; as long as >> cannot be used to close two
templates, I expect that that's the only case in most compilers,
so presumably, that's what you were looking for.)
--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34