Chapter 2 The Big Picture
2.1 Let's Get Meta!
- interpreter
if an application computes or “executes” sentences, we call that
application an interpreter. Examples include calculators, configuration file
readers, and Python interpreters. - translator
If we’re converting sentences from one language to another, we call that application a translator. Examples include Java to C# converters and compilers. - parsers or syntax analyzers
Programs that recognize (identify the various components in a phrase and differentiate it from other phrases) languages. - Two stages of parsing
- lexical analysis or tokenizing (lexer)
The process of grouping characters into words or symbols (tokens) - tokens
Tokens consist of at least two pieces of information: the token type (identifying the lexical structure) and the text matched for that token by the lexer. - parser
The second stage is the actual parser and feeds off of these tokens to recognize
the sentence structure.
By default, ANTLR-generated parsers build a data structure called a parse tree or syntax
tree that records how the parser recognized the structure of the input sentence
and its component phrases.
- lexical analysis or tokenizing (lexer)
We would specify phrase structure with a set of rules. Here is the grammar rule that corresponds to the first level of the assign subtree from the diagram":
assign : ID '=' expr ';' ; // match an assignment statement like "sp = 100;"
2.2 Implementing Parsers (之后有时间重新读一遍)
The ANTLR tool generates recursive-descent parsers from grammar rules.
A more general term for this kind of parsing is top-down parsing; recursive-descent parsers are just one kind of top-down parser implementation.
The stat symbol means calling method stat() for the parse tree. To get an idea of what recursive-descent parsers look like, here’s the (slightly cleaned up) method that ANTLR generates for rule assign:
// assign : ID '=' expr ';' ;
void assign() { // method generated from rule assign
match(ID); // compare ID to current input symbol then consume
match('=');
expr(); // match an expression by calling expr()
match(';');
}
For example, the stat rule that invokes assign
likely has a list of other kinds of statements.
/** Match any kind of statement starting at the current input position */
stat: assign // First alternative ('|' is alternative separator)
| ifstat // Second alternative
| whilestat
...
;
A parsing rule for stat looks like a switch.
void stat() {
switch ( «current input token» ) {
CASE ID : assign(); break;
CASE IF : ifstat(); break; // IF is token type for keyword 'if'
CASE WHILE : whilestat(); break;
...
default : «raise no viable alternative exception»
}
}
2.3 You Can't Put Too Much Water into a Nuclear Reactor (ambiguous phrase)
We have to provide unambiguous grammars.
- duplicate is an obvious ambiguities
- some more subtle ambiguities:
stat: expr ';' // expression statement
| ID '(' ')' ';' // function call statement
;
expr: ID '(' ')'
| INT
;
Here are the two interpretations of input f(); starting in rule stat:
Ambiguities can occur in the lexer as well as the parser, but ANTLR resolves
them so the rules behave naturally
BEGIN : 'begin' ; // match b-e-g-i-n sequence; ambiguity resolves to BEGIN
ID : [a-z]+ ; // match one or more of any lowercase letter
that input beginner would match only to rule ID. The lexer would not match beginner as BEGIN followed by an ID matching input ner.