5.1 Syntactic and Lexical Grammars

5.1.1 Context-Free Grammars

A context-free grammar consists of a number of productions. Each production has an abstract symbol called a nonterminal as its left-hand side, and a sequence of zero or more nonterminal and terminal symbols as its right-hand side. For each grammar, the terminal symbols are drawn from a specified alphabet.

A chain production is a production that has exactly one nonterminal symbol on its right-hand side along with zero or more terminal symbols.

Starting from a sentence consisting of a single distinguished nonterminal, called the goal symbol, a given context-free grammar specifies a language, namely, the (perhaps infinite) set of possible sequences of terminal symbols that can result from repeatedly replacing any nonterminal in the sequence with a right-hand side of a production for which the nonterminal is the left-hand side.

5.1.2 The Lexical and RegExp Grammars

A lexical grammar for ECMAScript is given in clause 12. This grammar has as its terminal symbols Unicode code points that conform to the rules for SourceCharacter defined in 11.1. It defines a set of productions, starting from the goal symbol InputElementDiv, InputElementTemplateTail, or InputElementRegExp, or InputElementRegExpOrTemplateTail, that describe how sequences of such code points are translated into a sequence of input elements.

Input elements other than white space and comments form the terminal symbols for the syntactic grammar for ECMAScript and are called ECMAScript tokens. These tokens are the reserved words, identifiers, literals, and punctuators of the ECMAScript language. Moreover, line terminators, although not considered to be tokens, also become part of the stream of input elements and guide the process of automatic semicolon insertion (12.9). Simple white space and single-line comments are discarded and do not appear in the stream of input elements for the syntactic grammar. A MultiLineComment (that is, a comment of the form /**/ regardless of whether it spans more than one line) is likewise simply discarded if it contains no line terminator; but if a MultiLineComment contains one or more line terminators, then it is replaced by a single line terminator, which becomes part of the stream of input elements for the syntactic grammar.

A RegExp grammar for ECMAScript is given in 22.2.1. This grammar also has as its terminal symbols the code points as defined by SourceCharacter. It defines a set of productions, starting from the goal symbol Pattern, that describe how sequences of code points are translated into regular expression patterns.

Productions of the lexical and RegExp grammars are distinguished by having two colons “::” as separating punctuation. The lexical and RegExp grammars share some productions.

5.1.3 The Numeric String Grammar

Another grammar is used for translating Strings into numeric values. This grammar is similar to the part of the lexical grammar having to do with numeric literals and has as its terminal symbols SourceCharacter. This grammar appears in 7.1.4.1.

Productions of the numeric string grammar are distinguished by having three colons “:::” as punctuation.

5.1.4 The Syntactic Grammar

The syntactic grammar for ECMAScript is given in clauses 13 through 16. This grammar has ECMAScript tokens defined by the lexical grammar as its terminal symbols (5.1.2). It defines a set of productions, starting from two alternative goal symbols Script and Module, that describe how sequences of tokens form syntactically correct independent components of ECMAScript programs.

When a stream of code points is to be parsed as an ECMAScript Script or Module, it is first converted to a stream of input elements by repeated application of the lexical grammar; this stream of input elements is then parsed by a single application of the syntactic grammar. The input stream is syntactically in error if the tokens in the stream of input elements cannot be parsed as a single instance of the goal nonterminal (Script or Module), with no tokens left over.

When a parse is successful, it constructs a parse tree, a rooted tree structure in which each node is a Parse Node. Each Parse Node is an instance of a symbol in the grammar; it represents a span of the source text that can be derived from that symbol. The root node of the parse tree, representing the whole of the source text, is an instance of the parse's goal symbol. When a Parse Node is an instance of a nonterminal, it is also an instance of some production that has that nonterminal as its left-hand side. Moreover, it has zero or more children, one for each symbol on the production's right-hand side: each child is a Parse Node that is an instance of the corresponding symbol.

New Parse Nodes are instantiated for each invocation of the parser and never reused between parses even of identical source text. Parse Nodes are considered the same Parse Node if and only if they represent the same span of source text, are instances of the same grammar symbol, and resulted from the same parser invocation.

Note 1

Parsing the same String multiple times will lead to different Parse Nodes. For example, consider:

let str = "1 + 1;";
eval(str);
eval(str);

Each call to eval converts the value of str into an ECMAScript source text and performs an independent parse that creates its own separate tree of Parse Nodes. The trees are distinct even though each parse operates upon a source text that was derived from the same String value.

Note 2
Parse Nodes are specification artefacts, and implementations are not required to use an analogous data structure.

Productions of the syntactic grammar are distinguished by having just one colon “:” as punctuation.

The syntactic grammar as presented in clauses 13 through 16 is not a complete account of which token sequences are accepted as a correct ECMAScript Script or Module. Certain additional token sequences are also accepted, namely, those that would be described by the grammar if only semicolons were added to the sequence in certain places (such as before line terminator characters). Furthermore, certain token sequences that are described by the grammar are not considered acceptable if a line terminator character appears in certain “awkward” places.

In certain cases, in order to avoid ambiguities, the syntactic grammar uses generalized productions that permit token sequences that do not form a valid ECMAScript Script or Module. For example, this technique is used for object literals and object destructuring patterns. In such cases a more restrictive supplemental grammar is provided that further restricts the acceptable token sequences. Typically, an early error rule will then define an error condition if "P is not covering an N", where P is a Parse Node (an instance of the generalized production) and N is a nonterminal from the supplemental grammar. Here, the sequence of tokens originally matched by P is parsed again using N as the goal symbol. (If N takes grammatical parameters, then they are set to the same values used when P was originally parsed.) An error occurs if the sequence of tokens cannot be parsed as a single instance of N, with no tokens left over. Subsequently, algorithms access the result of the parse using a phrase of the form "the N that is covered by P". This will always be a Parse Node (an instance of N, unique for a given P), since any parsing failure would have been detected by an early error rule.

5.1.5 Grammar Notation

Terminal symbols are shown in fixed width font, both in the productions of the grammars and throughout this specification whenever the text directly refers to such a terminal symbol. These are to appear in a script exactly as written. All terminal symbol code points specified in this way are to be understood as the appropriate Unicode code points from the Basic Latin range, as opposed to any similar-looking code points from other Unicode ranges. A code point in a terminal symbol cannot be expressed by a \ UnicodeEscapeSequence.

Nonterminal symbols are shown in italic type. The definition of a nonterminal (also called a “production”) is introduced by the name of the nonterminal being defined followed by one or more colons. (The number of colons indicates to which grammar the production belongs.) One or more alternative right-hand sides for the nonterminal then follow on succeeding lines. For example, the syntactic definition:

WhileStatement : while ( Expression ) Statement

states that the nonterminal WhileStatement represents the token while, followed by a left parenthesis token, followed by an Expression, followed by a right parenthesis token, followed by a Statement. The occurrences of Expression and Statement are themselves nonterminals. As another example, the syntactic definition:

ArgumentList : AssignmentExpression ArgumentList , AssignmentExpression

states that an ArgumentList may represent either a single AssignmentExpression or an ArgumentList, followed by a comma, followed by an AssignmentExpression. This definition of ArgumentList is recursive, that is, it is defined in terms of itself. The result is that an ArgumentList may contain any positive number of arguments, separated by commas, where each argument expression is an AssignmentExpression. Such recursive definitions of nonterminals are common.

The subscripted suffix “opt”, which may appear after a terminal or nonterminal, indicates an optional symbol. The alternative containing the optional symbol actually specifies two right-hand sides, one that omits the optional element and one that includes it. This means that:

VariableDeclaration : BindingIdentifier Initializeropt

is a convenient abbreviation for:

VariableDeclaration : BindingIdentifier BindingIdentifier Initializer

and that:

ForStatement : for ( LexicalDeclaration Expressionopt ; Expressionopt ) Statement

is a convenient abbreviation for:

ForStatement : for ( LexicalDeclaration ; Expressionopt ) Statement for ( LexicalDeclaration Expression ; Expressionopt ) Statement

which in turn is an abbreviation for:

ForStatement : for ( LexicalDeclaration ; ) Statement for ( LexicalDeclaration ; Expression ) Statement for ( LexicalDeclaration Expression ; ) Statement for ( LexicalDeclaration Expression ; Expression ) Statement

so, in this example, the nonterminal ForStatement actually has four alternative right-hand sides.

A production may be parameterized by a subscripted annotation of the form “[parameters]”, which may appear as a suffix to the nonterminal symbol defined by the production. “parameters” may be either a single name or a comma separated list of names. A parameterized production is shorthand for a set of productions defining all combinations of the parameter names, preceded by an underscore, appended to the parameterized nonterminal symbol. This means that:

StatementList[Return] : ReturnStatement ExpressionStatement

is a convenient abbreviation for:

StatementList : ReturnStatement ExpressionStatement StatementList_Return : ReturnStatement ExpressionStatement

and that:

StatementList[Return, In] : ReturnStatement ExpressionStatement

is an abbreviation for:

StatementList : ReturnStatement ExpressionStatement StatementList_Return : ReturnStatement ExpressionStatement StatementList_In : ReturnStatement ExpressionStatement StatementList_Return_In : ReturnStatement ExpressionStatement

Multiple parameters produce a combinatory number of productions, not all of which are necessarily referenced in a complete grammar.

References to nonterminals on the right-hand side of a production can also be parameterized. For example:

StatementList : ReturnStatement ExpressionStatement[+In]

is equivalent to saying:

StatementList : ReturnStatement ExpressionStatement_In

and:

StatementList : ReturnStatement ExpressionStatement[~In]

is equivalent to:

StatementList : ReturnStatement ExpressionStatement

A nonterminal reference may have both a parameter list and an “opt” suffix. For example:

VariableDeclaration : BindingIdentifier Initializer[+In]opt

is an abbreviation for:

VariableDeclaration : BindingIdentifier BindingIdentifier Initializer_In

Prefixing a parameter name with “?” on a right-hand side nonterminal reference makes that parameter value dependent upon the occurrence of the parameter name on the reference to the current production's left-hand side symbol. For example:

VariableDeclaration[In] : BindingIdentifier Initializer[?In]

is an abbreviation for:

VariableDeclaration : BindingIdentifier Initializer VariableDeclaration_In : BindingIdentifier Initializer_In

If a right-hand side alternative is prefixed with “[+parameter]” that alternative is only available if the named parameter was used in referencing the production's nonterminal symbol. If a right-hand side alternative is prefixed with “[~parameter]” that alternative is only available if the named parameter was not used in referencing the production's nonterminal symbol. This means that:

StatementList[Return] : [+Return]ReturnStatement ExpressionStatement

is an abbreviation for:

StatementList : ExpressionStatement StatementList_Return : ReturnStatement ExpressionStatement

and that:

StatementList[Return] : [~Return]ReturnStatement ExpressionStatement

is an abbreviation for:

StatementList : ReturnStatement ExpressionStatement StatementList_Return : ExpressionStatement

When the words “one of” follow the colon(s) in a grammar definition, they signify that each of the terminal symbols on the following line or lines is an alternative definition. For example, the lexical grammar for ECMAScript contains the production:

NonZeroDigit :: one of 1 2 3 4 5 6 7 8 9

which is merely a convenient abbreviation for:

NonZeroDigit :: 1 2 3 4 5 6 7 8 9

If the phrase “[empty]” appears as the right-hand side of a production, it indicates that the production's right-hand side contains no terminals or nonterminals.

If the phrase “[lookahead = seq]” appears in the right-hand side of a production, it indicates that the production may only be used if the token sequence seq is a prefix of the immediately following input token sequence. Similarly, “[lookahead ∈ set]”, where set is a finite nonempty set of token sequences, indicates that the production may only be used if some element of set is a prefix of the immediately following token sequence. For convenience, the set can also be written as a nonterminal, in which case it represents the set of all token sequences to which that nonterminal could expand. It is considered an editorial error if the nonterminal could expand to infinitely many distinct token sequences.

These conditions may be negated. “[lookahead ≠ seq]” indicates that the containing production may only be used if seq is not a prefix of the immediately following input token sequence, and “[lookahead ∉ set]” indicates that the production may only be used if no element of set is a prefix of the immediately following token sequence.

As an example, given the definitions:

DecimalDigit :: one of 0 1 2 3 4 5 6 7 8 9 DecimalDigits :: DecimalDigit DecimalDigits DecimalDigit

the definition:

LookaheadExample :: n [lookahead ∉ { 1, 3, 5, 7, 9 }] DecimalDigits DecimalDigit [lookahead ∉ DecimalDigit]

matches either the letter n followed by one or more decimal digits the first of which is even, or a decimal digit not followed by another decimal digit.

Note that when these phrases are used in the syntactic grammar, it may not be possible to unambiguously identify the immediately following token sequence because determining later tokens requires knowing which lexical goal symbol to use at later positions. As such, when these are used in the syntactic grammar, it is considered an editorial error for a token sequence seq to appear in a lookahead restriction (including as part of a set of sequences) if the choices of lexical goal symbols to use could change whether or not seq would be a prefix of the resulting token sequence.

If the phrase “[no LineTerminator here]” appears in the right-hand side of a production of the syntactic grammar, it indicates that the production is a restricted production: it may not be used if a LineTerminator occurs in the input stream at the indicated position. For example, the production:

ThrowStatement : throw [no LineTerminator here] Expression ;

indicates that the production may not be used if a LineTerminator occurs in the script between the throw token and the Expression.

Unless the presence of a LineTerminator is forbidden by a restricted production, any number of occurrences of LineTerminator may appear between any two consecutive tokens in the stream of input elements without affecting the syntactic acceptability of the script.

When an alternative in a production of the lexical grammar or the numeric string grammar appears to be a multi-code point token, it represents the sequence of code points that would make up such a token.

The right-hand side of a production may specify that certain expansions are not permitted by using the phrase “but not” and then indicating the expansions to be excluded. For example, the production:

Identifier :: IdentifierName but not ReservedWord

means that the nonterminal Identifier may be replaced by any sequence of code points that could replace IdentifierName provided that the same sequence of code points could not replace ReservedWord.

Finally, a few nonterminal symbols are described by a descriptive phrase in sans-serif type in cases where it would be impractical to list all the alternatives:

SourceCharacter :: any Unicode code point