The C++ grammar is convoluted. These notes offer a quickstart/bird's eye view into it.
Here are some versions of the grammar (or C-variants):
- https://alx71hub.github.io/hcb/
- https://cs.wmich.edu/~gupta/teaching/cs4850/sumII06/The%20syntax%20of%20C%20in%20Backus-Naur%20form.htm
- https://www.open-std.org/jtc1/sc22/wg21/docs/standards#14882
Helpful simplifications:
- https://jimfix.github.io/csci221/lectures/C-grammar-wk01.txt
- Ellis and Strastroup, The Annotated C++ Manual, the Appendix (section 17).
A translation unit is a sequence of declarations.
translation-unit ::= declaration-seq
In Clang's AST, this is the TranslationUnitDecl
.
A declaration sequence is either a single declaration, or a declaration sequence terminating in a declaration:
declaration-seq ::= declaration
| declaration-seq declaration
It's a bit like the functional-programming way to do lists: you have nil and cons, and you push an element onto the head of a list to make it longer. To parse it, you take the head off, and then recursively step into the tail. Here, it's just reverse-cons. Or you can think of the snake pointing in the other direction: the head points to the left and the tail is everything to the left of that last element.
What sorts of things can be declared at the top level? Lots of things. Functions, templates, namespaces, etc. It's C++.
For simplicity, let's just think of the case where you have global variables and functions:
declaration ::= block-declaration | function-definition | ...
We'll cover function definitions in a moment. For the moment, think about top-level block-declarations.
Top-level block-declarations can take many forms. You can declare an alias, namespaces, using-declarations, etc. The easy case is just a simple-declaration:
block-declaration ::= simple-declaration | ...
A simple-declaration allows you to specify an optional list of initial declarations (which are optionally qualified by attribute and declaration specifiers):
simple-declaration ::= [attribute-secifier-seq] [decl-specifier-seq] [init-declarator-list];
Note that you can just have an empty simple-declaration here: just a semicolon!
For simplicity, we'll think about the simple case where you declare
some variables and some types. The types are the decl-specifier-seq
and the variables are the init-declarator-list
.
For the decl-specifier-seq
,this is a reverse-consed sequence of
decl-specifier
s:
decl-specifier-seq ::= decl-specifier | decl-specifier-seq decl-specifier
A decl-specifier can specify the storage class, type, etc:
decl-specifier ::= storage-class-specifier | type-specifier | ...
The type can be specified as a trailing-type-specifier, etc:
type-specifier ::= trailing-type-specifier | ...
And that in turn can be a simple-type, among other things:
trailing-type-speifier ::= simple-type-specifier | ...
Simple types include thinsg like int
, char
, etc.
If you want a list of initial declarations, beyod the types, it should be in the form of a reverse-consed list of init-declarators:
init-declarator-list ::= init-declarator | init-declarator-list init-declarator
What is an init-declarator
? It consists of a declarator, and an
optional initializer:
init-declarator ::= declarator [initializer]
A declarator effectively names a variable or a type or a namespace or whatever it is you're trying to declare. The simple case is to think of it simply as declarating a variable.
However, you can declare a pointer variable, or a non-pointer variable qualified by parameters and a trailing return type.
For simplicity, we can ignore the qualified non-pointer type, and just think about the ptr-declarators:
declarator ::= ptr-declarator | ...
What does this include? It can be a non-pointer declaration, or a non-pointer declaration pre-pended by a pointer operator:
ptr-declarator ::= noptr-declarator
| ptr-operator ptr-declarator
So, for instance, you can do x
, or *x
, or **x
, and so on.
A non-pointer declarator, in turn can just be the declarator's ID, optionally qualified with a sequence of attribute specifiers:
noptr-declarator ::= declarator-id [attribute-specifier-seq] | ...
For simplicity, we can just ignore the attribute specifiers, and think of a noptr-declarator as simply being a declarator-id:
noptr-declarator ::= declarator-id | ...
The declarator-id can be an id-expression, or a reference to a classname. For our purposes, we can just focus on the id-expression:
declarator-id ::= id-expression | ...
An id-expression can be a qualified or unqualified ID:
id-expression ::= unqualified-id | qualified-id
In C++ jargon, the difference here is whether you have some "Foo::" in front of an identifier "x". The "Foo::" qaulifies the name "x" so that "x" is part of the "Foo" namespace. Without that, "x" is just an unqualified name.
For simplicity, we'll focus only on the qualified case. This is just an identifier:
unqualified-id ::= identifier | ...
Example. Suppose you have a global declaration in your file:
int foo;
Here's the production/expansion:
declaration ::=> block-declaration
block-declaration ::=> simple-declaration
simple-declaration ::=> <a>decl-specifier-seq <b>init-declarator-list;
<a>decl-specifier-seq ::=> decl-specifier
decl-specifier ::=> type-specifier
type-specifier ::=> trailing-type-specifier
trailing-type-specifier ::=> simple-type-specifier
simple-type-specifier ::=> "int"
<b>init-declarator-list ::=> init-declarator
init-declarator ::=> declarator [initializer]
declarator ::=> ptr-declarator
ptr-declarator ::=> noptr-declarator
noptr-declarator ::=> declarator-id
declarator-id ::=> id-expression
id-expression ::=> unqualified-id
unqualified-id ::=> identifier
identifier ::=> "foo"
You can also provide an optional initializer for a top level
declaration like this. For instance, adding 3 + 2
as an
initializer of foo
:
int foo = 3 + 2;
An initializer can be a brace-or-equal-initializer, or a list of expressions.
initializer ::= brace-or-equal-initializer
| ( expression-list )
For simplicity, we'll just focus on the brace-or-equal-initializer. This can either be the braced initial value, or an equals sign followed by an initializer clause:
brace-or-equal-initializer ::= = initializer-clause
| braced-init-list
The initializer-clause is, in turn, an assignment-expression:
initializer-clause ::= assigment-expression | ...
There are examples of assignment-expressions below.
In Clang's AST, these top-level init-declarators will have the form
of SomethingDecl
children of the parent TranslationUnitDecl
.
In Clang's AST, a top-level function-definition
is a top-level FunctionDecl
,
i.e., a FunctionDecl
that is a child of the TranslationUnitDecl
.
A function definition can take optional attribute specifiers and decl specifiers, along with a declarator to name the function and a function body.
function-definition ::= [attribute-specifier-seq] [decl-specifier-seq] declarator function-body
The decl-specifiers involve stipulating the return type of the function, among other things, and we saw an example of it in the last section. The declarotor is also just an identifier, and we saw an example of that in the last section too. The function body of a function involves an optional ctor-initializer, and then a compound-statement for the body of the function:
function-body ::= [ctor-initializer] compound-statement
We can ignore the ctor-initializer for simplicity, and instead just think of a function as having a function body.
The body of a function is a compound-statement. This is a sequence of zero or more statements between curly braces:
compound-statement ::= { [statement-seq] }
In Clang's AST, the function body is a CompoundStmtDecl
, and its
children are the list of statements.
A statement sequence is of course a reverse-consed sequence of statements:
statement-seq ::= statement | statement-seq statement
A statement can be an if-statement (which C++ calls a selection-statement), a while-loop or for-loop (which C++ calls iteration statements), a jump-statement, another compound statement, an expression statement, a manually labeled statement (a target of jumps/gotos), a try block, and more declarations:
statement ::= declaration-statement
| labeled-statement
| selection-statement
| iteration-statement
| jump-statement
| compound-statement
| expression-statement
| try-block
Note that C++ does not have an assignment statement. Instead, it just has expression statements, whose contents are just expressions (and even the expression can be optional):
expression-statement ::= [expression]
Since an expression can be optional, a "skip statement" is
really just an empty expression statement in C++. In Clang's AST,
it is a NullStmt
.
Expressions can take many forms. One version of an expression is an assignment. Other examples involve arithmetic expressions, boolean expressions, or even just a character or string literal.
Here are some helpful grammar rules from the Annotated Manual's appendix:
name ::= identifier
literal ::= integer-constant
primary-expression ::= literal | name
postfix-expression ::= primary-expression | postfix-expression++
unary-operator ::= * | & | + | - | ! | ~
cast-expression ::= unary-expression
| ( type-name ) cast-expression
unary-expression ::= postfix-expression
| ++unary-expression
| sizeof unary-expression
| sizeof ( type-name )
| unary-operator cast-expression
pm-expression ::= cast-expression | ...
multiplicative-expression ::= pm-expression
| multiplicative-expression * pm-expression
additive-expression ::= multiplicative-expression
| additive-expression + multiplicative expression
shift-expression ::= additive-expression
| shift-expression << additive-expression
relational-expression ::= shift-expression
| relational-expression < shift-expression
equality-expression ::= relational-expression
| equality-expression == relational-expression
and-expression ::= equality-expression | ...
exclusive-or-expression ::= and-expression | ...
inclusive-or-expression ::= exclusive-or-expression | ...
logical-and-expression ::= inclusive-or-expression | ...
logical-or-expression ::= logical-and-expression | ...
conditional-expression ::= logical-or-expression | ...
assignment-operator ::= = | *= | +=
assignment-expression
::= conditional-expression
| unary-expression assignment-operator assignment-expression
expression ::= assignment-expression
| expression , assignment-expression
For an example, suppose I have a statement like this:
x = 3;
Here's the production:
expression ::=> assignment-expression
assignment-expression ::=>
<a>unary-expression <b>assignment-operator <c>assignment-expression
<a>unary-expression ::=> postfix-expression
postfix-expression ::=> primary-expression
primary-expression ::=> name
name ::=> identifier
identifier ::=> "x"
<b>assignment-operator ::=> "="
<c>assignment-expression ::=> conditional-expression
conditional-expression ::=> logical-or-expression
logical-or-expression ::=> logical-and-expression
logical-and-expression ::=> inclusive-or-expression
inclusive-or-expression ::=> exclusive-or-expression
exclusive-or-expression ::=> and-expression
and-expression ::=> equality-expression
equality-expression ::=> relational-expression
shift-expression ::=> additive expression
additive-expression ::=> multiplicative-expression
multiplicative-expression ::=> pm-expression
pm-expression ::=> cast-expression
cast-expression ::=> unary-expression
unary-expression ::=> postfix-expression
postfix-expression ::=> primary-expression
primary-expression ::=> literal
literal ::=> integer-literal
integer-literal ::= "3"